Securely Managing Secrets for Spark Jobs

Security is critical when handling sensitive data like passwords, API keys, and connection strings. Hardcoding credentials in your job code or Docker images is a major security risk. Instead, إيلوم leverages native Kubernetes Secrets to inject credentials securely at runtime.

Quick Summary

Create: Store sensitive data in Kubernetes Secrets (kubectl create secret generic).
Mount: Inject secrets into Spark drivers/executors as Environment Variables (recommended) or Volume Mounts.
Verify: Access them in your code (e.g., os.environ.get) without exposing them in logs.

Step 1: Creating Kubernetes Secrets for Spark

First, you need to create a Kubernetes Secret object. This secret must reside in the same namespace where your Ilum Job will be executed (typically افتراضي or your specific tenant namespace).

Create Secret Command
kubectl create secret generic my-db-creds \
  --from-literal=اسم المستخدم=المشرف \
  --from-literal=شعار=SuperSecret123 \
  -ن افتراضي

Namespace Requirement

Secrets are namespaced objects. They must be created in the same namespace where your Spark jobs run. For Ilum, this is typically افتراضي. Use the -ن flag to specify the target namespace.

Step 2: Mounting Secrets in Ilum Spark Jobs

You can expose the secret to your Spark driver and executors in two ways: as environment variables or as files via volume mounts.

Method A: Environment Variables (Recommended)

This is the most common method for passing simple keys like database passwords or API tokens.

Spark Configuration
Python (PySpark)
Scala / Java

When submitting a job (via Ilum UI, API, or CLI), add the following Spark configurations to map secret keys to environment variables:

Spark Config
# Valid for Spark on Kubernetes
# Syntax: spark.kubernetes.[driver|executor].secretKeyRef.[ENV_VAR_NAME]=name-of-secret:key-in-secret

spark.kubernetes.driver.secretKeyRef.DB_PASSWORD=my-db-creds:password
spark.kubernetes.executor.secretKeyRef.DB_PASSWORD=my-db-creds:password
spark.kubernetes.driver.secretKeyRef.DB_USER=my-db-creds:username
spark.kubernetes.executor.secretKeyRef.DB_USER=my-db-creds:username

In your Python code, access the credentials using the standard os مكتبة:

main.py
استورد os

# Retrieve credentials securely from environment
db_user = os.environ.حصل("DB_USER")
db_pass = os.environ.حصل("DB_PASSWORD")

لو لا db_pass:
    raise ValueError("Database password not found in environment variables")

طبع(f"Connecting as user: {db_user}") # Safe to log username, never log password!

In Scala, use System.getenv:

Main.scala
val dbUser = System.getenv("DB_USER")
val dbPass = System.getenv("DB_PASSWORD")

Method B: Volume Mounts

Use this method when your application expects a file path (e.g., SSL certificates, Keytab files, or configuration files).

Spark Configuration
Python Code

Mount the entire secret as a directory. Each key in the secret becomes a file in that directory.

Spark Config
# Mount secret 'my-db-creds' to /etc/secrets/db
spark.kubernetes.driver.secrets.my-db-creds=/etc/secrets/db
spark.kubernetes.executor.secrets.my-db-creds=/etc/secrets/db

Read the file contents from the mounted path:

main.py
# Read password from the mounted file
مع open('/etc/secrets/db/password', 'r') مثل f:
    db_pass = f.قرأ().strip()

مع open('/etc/secrets/db/username', 'r') مثل f:
    db_user = f.قرأ().strip()

Best Practices: Env Vars vs. Volumes

ميزة	Environment Variables	Volume Mounts
أفضل ل	Simple strings (API Keys, Passwords, URLs)	Files (Certificates, JSON configs, Keytabs)
Complexity	Low (Standard `os.environ` access)	Medium (File I/O required)
Updates	Requires Job Restart	Can update live (if app supports reloading)
أمن	Visible in process dump (rare risk)	Writes to tmpfs (memory), safer for large secrets

Step 3: Using Secrets in Airflow DAGs

If you are orchestrating jobs with the Ilum Airflow integration, you can define these configurations directly in your DAGs using the IlumSparkSubmitOperator.

airflow_dag.py
من airflow استورد DAG
من إيلوم.airflow.operators استورد IlumSparkSubmitOperator

# ... (DAG definition)

submit_spark_job = IlumSparkSubmitOperator(
    task_id="secure_spark_job",
    spark_conf={
        "spark.kubernetes.driver.secretKeyRef.DB_PASSWORD": "my-db-creds:password",
        "spark.kubernetes.executor.secretKeyRef.DB_PASSWORD": "my-db-creds:password"
    }
    # ... other configurations
)

Troubleshooting & Verification

verifying Secret Existence

Before running your job, verify the secret exists and contains the correct data:

Verify Secret
# List secrets in the namespace
kubectl get secrets -ن افتراضي

# Decode secret values (for debugging only!)
kubectl get secret my-db-creds -س jsonpath='{.data.password}' -ن افتراضي | base64 --decode

Debugging Running Pods

If your job fails with authentication errors, inspect the running pod to ensure variables are mounted correctly.

Inspect Pod Environment
# 1. Find the driver pod name
kubectl الحصول على القرون -ن افتراضي | grep spark-driver

# 2. Check environment variables inside the pod
كوبيكتل exec -it <spark-driver-pod-name> -ن default -- الحياه الفطريه | grep DB_

Security Reminder

Never print secret values to the console or logs in your production code. If you must debug, print only the first few characters or a checksum/hash of the secret.

Frequently Asked Questions (FAQ)

How do I access secrets in PySpark?

The most secure way is to map the Kubernetes secret to an environment variable using the spark.kubernetes.driver.secretKeyRef.[VAR_NAME] configuration. Then, in your PySpark script, use import os و os.environ.get('VAR_NAME') to retrieve the value.

Can I use external secret stores (Vault, AWS Secrets Manager)?

Yes. You can use the Kubernetes Secrets Store CSI Driver to sync secrets from external providers (HashiCorp Vault, AWS, Azure, GCP) into native Kubernetes Secrets. Once synced, Ilum consumes them just like standard Kubernetes secrets.

Why is my secret not visible to the Spark job?

Common reasons include:

Namespace Mismatch: The secret exists in افتراضي but the job is running in spark-jobs.
Typo in Key Name: The key in the secret (e.g., db-pass) doesn't match the config (e.g., db_pass).
Service Account Permissions: The Spark Service Account might lack get/list permissions for Secrets (though standard Ilum setups handle this).

Step 1: Creating Kubernetes Secrets for Spark​

Step 2: Mounting Secrets in Ilum Spark Jobs​

Method A: Environment Variables (Recommended)​

Method B: Volume Mounts​

Best Practices: Env Vars vs. Volumes​

Step 3: Using Secrets in Airflow DAGs​

Troubleshooting & Verification​

verifying Secret Existence​

Debugging Running Pods​

Frequently Asked Questions (FAQ)​