Configure Cloud Object Storage (GCS, S3, Azure) for Data Lake
Ilum allows you to link جي سي إس , S3 , WASBS و HDFS storages to your clusters. Linking storage allows Ilum to automatically configure all your jobs to use your cloud data lakes seamlessly, eliminating the need for manual Spark parameter configuration.
Supported Storage Providers
| Provider | نوع | وصف |
|---|---|---|
| التخزين السحابي من Google | جي سي إس | Native integration for GCP projects. |
| أمازون S3 | S3 | Standard S3 and S3-compatible storage support. |
| تخزين Azure Blob | WASBS/ABFS | Integration for Azure data lakes. |
| HDFS | HDFS | Connect to existing Hadoop Distributed File Systems. |
- التخزين السحابي من Google (GCS)
- أمازون S3
- تخزين Azure Blob
التخزين السحابي من Google (GCS)
Step 1: Create a GCS Bucket
عرض:
-
Create a Google Cloud Project
- Open Google Cloud Consoleوانتقل إلى محدد المشروع / Manage Resources.
- نقر New Project/ Create Project.
- Enter a Project name, choose Organizationو مكان .
-
Create a GCS Bucket
- In the Console, navigate to التخزين السحابي → Buckets.
- نقر خلق .
- Enter a globally unique Bucket name (e.g.,
my-ilum-bucket) and select your Region.
ملاحظهRemember the bucket name you created - you will need it when adding this storage to Ilum.
-
Create a Service Account and JSON Key
- الانتقال إلى IAM & Admin → Service Accounts.
- نقر Create Service Account, fill in details, and grant Storage Admin roles.
- Click the created email, go to the Keys tab, and Create new key (JSON).
- Save the downloaded JSON file securely.
importantOrganization Policy Update: In new organizations, creating service account keys might be disabled by default. Contact your administrator if you cannot create keys.
Step 2: Add GCS to Ilum Cluster
عرض:
-
Navigate to عبء العمل → العناقيد → حرر → خزن → Add Storage.
-
Configure General Settings:
| Parameter | Value Example | وصف |
|---|---|---|
| اسم | my-gcs-storage | Unique name for this storage config. |
| نوع | جي سي إس | Select GCS provider. |
| دلو شرارة | my-ilum-bucket | Bucket for Spark logs/events. |
| دلو البيانات | my-ilum-bucket | Bucket for your data. |
- Configure GCS Authorization: Open your JSON key file and copy the values:
| Parameter | Source Key | وصف |
|---|---|---|
| Client Email | client_email | Service account email address. |
| Private Key | private_key | Full key including -----BEGIN.... |
| Private Key ID | private_key_id | Key ID string. |
- نقر إرسال to save.
أمازون S3
The process for adding S3 storage is nearly identical to GCS. You will need to provide your AWS credentials (Access Key and Secret Key) instead of a JSON key file.
- Navigate to عبء العمل → العناقيد → حرر → خزن → Add Storage.
- Select S3 as the نوع .
- Fill in the required fields:
| Parameter | وصف |
|---|---|
| اسم | Unique name for this storage config. |
| Access Key | Your AWS Access Key ID. |
| Secret Key | Your AWS Secret Access Key. |
| Region | AWS Region of your bucket (e.g., الولايات المتحدة الشرقية 1 ). |
| نقطه النهايه | (Optional) Custom endpoint for S3-compatible storage (e.g., MinIO). |
تخزين Azure Blob
The process for adding Azure storage is nearly identical to GCS and S3. You will need your Azure Storage Account Name and Access Key.
- Navigate to عبء العمل → العناقيد → حرر → خزن → Add Storage.
- Select Azure (or WASBS) as the نوع .
- Fill in the required fields:
| Parameter | وصف |
|---|---|
| اسم | Unique name for this storage config. |
| Account Name | Your Azure Storage Account name. |
| Account Key | Your Azure Storage Account Access Key. |
| Container | Name of the container to use. |
Step 3: Verify Connection
To ensure your storage is correctly configured, run a simple Spark job.
-
Create a Code Service:
- الانتقال إلى عبء العمل → خدمات → New Service +.
- Select نوع :
رمز, اللغة :سكالا, and your عنقود .
-
Execute Test Code: Paste and run the following Scala code:
Test Storage Connection// Write test data
valبيانات = Seq( ( "Alice", 34) , ( "Bob", 45) )
valمدافع = شراره . createDataFrame ( بيانات ) . toDF( "الاسم" , "age")
// Replace with your bucket path (e.g., gs://..., s3a://..., wasbs://...)
valمسار = "gs://my-ilum-bucket/output/"
مدافع . يكتب . طريقة ( "الكتابة فوق" ) . format( "csv") . save( مسار )
// Read back data
شراره . قرأ . format( "csv") . load( مسار ) . عرض ( ) -
Check Results: If the job completes and displays the data table, your storage connection is active.
Common Issues & FAQ
Why do I get a "Permission Denied" error?
سبب: The Service Account or User doesn't have permissions to access the bucket. حل:
- Go to your cloud provider's console (e.g., Google Cloud Console).
- Navigate to the bucket's اذونات التبويب.
- Grant your service account the Storage Adminأو Storage Object Admin role.
Why does it say "Bucket does not exist"?
سبب: The bucket name in your code doesn't match the actual bucket name, or the region is incorrect. حل:
- Verify the bucket exists in your cloud console.
- Check that the bucket name in your code matches exactly (names are often case-sensitive).
Why do I get "Invalid credentials"?
سبب: The keys (JSON or Access Keys) were not copied correctly. حل:
- Re-open your key file.
- Carefully copy the values again. For GCS, ensure you include the
-----ابدأ المفتاح الخاص-----و----- المفتاح الخاص النهائي-----lines. - Re-save the storage configuration in Ilum.