تخطي إلى المحتوى الرئيسي

Securely Managing Secrets for Spark Jobs

Security is critical when handling sensitive data like passwords, API keys, and connection strings. Hardcoding credentials in your job code or Docker images is a major security risk. Instead, إيلوم leverages native Kubernetes Secrets to inject credentials securely at runtime.

Quick Summary
  • Create: Store sensitive data in Kubernetes Secrets (kubectl create secret generic).
  • Mount: Inject secrets into Spark drivers/executors as Environment Variables (recommended) or Volume Mounts.
  • Verify: Access them in your code (e.g., os.environ.get) without exposing them in logs.

Step 1: Creating Kubernetes Secrets for Spark

First, you need to create a Kubernetes Secret object. This secret must reside in the same namespace where your Ilum Job will be executed (typically افتراضي or your specific tenant namespace).

Create Secret Command
kubectl create secret generic my-db-creds \
--from-literal=username=admin \
--from-literal=password=SuperSecret123 \
-n default
Namespace Requirement

Secrets are namespaced objects. They must be created in the same namespace where your Spark jobs run. For Ilum, this is typically افتراضي . Use the -ن <namespace> flag to specify the target namespace.

Step 2: Mounting Secrets in Ilum Spark Jobs

You can expose the secret to your Spark driver and executors in two ways: as environment variables or as files via volume mounts.

This is the most common method for passing simple keys like database passwords or API tokens.

When submitting a job (via Ilum UI, API, or CLI), add the following Spark configurations to map secret keys to environment variables:

Spark Config
# Valid for Spark on Kubernetes
# Syntax: spark.kubernetes.[driver|executor].secretKeyRef.[ENV_VAR_NAME]=name-of-secret:key-in-secret

spark.kubernetes.driver.secretKeyRef.DB_PASSWORD= my-db-creds:password
spark.kubernetes.executor.secretKeyRef.DB_PASSWORD= my-db-creds:password
spark.kubernetes.driver.secretKeyRef.DB_USER= my-db-creds:username
spark.kubernetes.executor.secretKeyRef.DB_USER= my-db-creds:username

Method B: Volume Mounts

Use this method when your application expects a file path (e.g., SSL certificates, Keytab files, or configuration files).

Mount the entire secret as a directory. Each key in the secret becomes a file in that directory.

Spark Config
# Mount secret 'my-db-creds' to /etc/secrets/db
spark.kubernetes.driver.secrets.my-db-creds= /etc/secrets/db
spark.kubernetes.executor.secrets.my-db-creds= /etc/secrets/db

Best Practices: Env Vars vs. Volumes

ميزة Environment VariablesVolume Mounts
أفضل ل Simple strings (API Keys, Passwords, URLs)Files (Certificates, JSON configs, Keytabs)
ComplexityLow (Standard os.environ access)Medium (File I/O required)
UpdatesRequires Job RestartCan update live (if app supports reloading)
أمن Visible in process dump (rare risk)Writes to tmpfs (memory), safer for large secrets

Step 3: Using Secrets in Airflow DAGs

If you are orchestrating jobs with the Ilum Airflow integration, you can define these configurations directly in your DAGs using the IlumSparkSubmitOperator.

airflow_dag.py
من  airflow استورد  DAG
من إيلوم . airflow. operators استورد IlumSparkSubmitOperator

# ... (DAG definition)

submit_spark_job = IlumSparkSubmitOperator(
task_id= "secure_spark_job",
spark_conf= {
"spark.kubernetes.driver.secretKeyRef.DB_PASSWORD": "my-db-creds:password",
"spark.kubernetes.executor.secretKeyRef.DB_PASSWORD": "my-db-creds:password"
}
# ... other configurations
)

Troubleshooting & Verification

verifying Secret Existence

Before running your job, verify the secret exists and contains the correct data:

Verify Secret
# List secrets in the namespace
kubectl get secrets -n default

# Decode secret values (for debugging only!)
kubectl get secret my-db-creds -o jsonpath='{.data.password}' -n default | base64 --decode

Debugging Running Pods

If your job fails with authentication errors, inspect the running pod to ensure variables are mounted correctly.

Inspect Pod Environment
# 1. Find the driver pod name
kubectl get pods -n default | grep spark-driver

# 2. Check environment variables inside the pod
kubectl exec -it <spark-driver-pod-name> -n default -- env | grep DB_
Security Reminder

Never print secret values to the console or logs in your production code. If you must debug, print only the first few characters or a checksum/hash of the secret.


Frequently Asked Questions (FAQ)

How do I access secrets in PySpark?

The most secure way is to map the Kubernetes secret to an environment variable using the spark.kubernetes.driver.secretKeyRef.[VAR_NAME] configuration. Then, in your PySpark script, use import osو os.environ.get('VAR_NAME') to retrieve the value.

Can I use external secret stores (Vault, AWS Secrets Manager)?

Yes. You can use the Kubernetes Secrets Store CSI Driver to sync secrets from external providers (HashiCorp Vault, AWS, Azure, GCP) into native Kubernetes Secrets. Once synced, Ilum consumes them just like standard Kubernetes secrets.

Why is my secret not visible to the Spark job?

Common reasons include:

  • Namespace Mismatch: The secret exists in افتراضي but the job is running in spark-jobs.
  • Typo in Key Name: The key in the secret (e.g., db-pass) doesn't match the config (e.g., db_pass).
  • Service Account Permissions: The Spark Service Account might lack get/list permissions for Secrets (though standard Ilum setups handle this).