MLflow Integration
System Architecture
Ilum integrates MLflow as a managed service within the Kubernetes cluster, providing a centralized control plane for MLOps workflows. The architecture consists of three primary components:
- Tracking Server: A centralized REST API server (
ilum-mlflow-tracking) handling metadata and artifact requests. - Backend Store: A PostgreSQL database managing entity metadata (runs, parameters, metrics, tags, and model registry data).
- Artifact Store: An S3-compatible object storage layer (defaulting to MinIO or external S3) for persisting large artifacts such as model binaries, plots, and datasets.
Network Topology
In the Ilum ecosystem, Spark Drivers and Executors interact with MLflow through internal Kubernetes networking.
- Metadata Operations: The Spark Driver communicates directly with the Tracking Server via HTTP/REST.
- Artifact Operations: Both Drivers and Executors interact with the S3 Artifact Store to upload/download model files. Credentials and endpoints are automatically injected into the Spark environment.
Configuration and Deployment
Helm Configuration
MLflow is enabled via the core Ilum Helm chart. The deployment ensures that the tracking server is co-located with the necessary storage configurations.
ترقية Helm \
--set mlflow.enabled=true \
- إعادة استخدام قيم ilum ilum / ilum
This configuration provisions:
- A Kubernetes Deployment for the MLflow server.
- A ClusterIP Service exposing port 5000.
- Auto-wired secrets for database and S3 access.
Spark Runtime Environment
To utilize MLflow features, Spark workloads must run in an environment containing the MLflow Python client and compatible dependencies. Ilum provides optimized Docker images for this purpose.
Default Image: ILUM / شرارة: 3.5.3 ملتدفق
This image includes:
ملفلو(Python client)boto3(AWS SDK for S3 artifact access)psycopg2(PostgreSQL adapter)- Java dependencies for S3A file system support
Job Configuration
For custom Spark jobs or interactive sessions, define the image in the Kubernetes specification:
المواصفات :
حاويات :
- اسم : شراره - kubernetes- سائق
صورة : إيلوم / شرارة : 3.5.3 - ملفلو
تتبع التجربة
Ilum pre-configures the environment variables required for tracking. The MLFLOW_TRACKING_URIتم ضبطه على http://ilum-mlflow-tracking:5000 بشكل افتراضي.
Spark Autologging
For Spark MLlib workloads, mlflow.spark.autolog() captures datasource information, Spark configurations, and model metrics automatically.
من بايسبارك . SQL استورد جلسة سبارك
من بايسبارك . ml. regression استورد LinearRegression
من بايسبارك . ml. feature استورد VectorAssembler
استورد ملفلو
استورد ملفلو . شراره
# Initialize Spark
شراره = جلسة سبارك . builder. appName ( "MLflow_Example") . getOrCreate ( )
# Enable Spark Autologging
ملفلو . شراره . autolog( )
# Prepare Data
بيانات = شراره . قرأ . format( "libsvm") . load( "sample_libsvm_data.txt")
train_data, test_data = بيانات . randomSplit( [ 0.7, 0.3] )
# Define Experiment
ملفلو . set_experiment( "/Ilum/SparkExperiments")
مع ملفلو . start_run ( run_name= "LinearRegression_v1") :
# Training
lr = LinearRegression( maxIter= 10 , regParam= 0.3, elasticNetParam= 0.8)
نموذج = lr. fit( train_data)
# Autologging captures:
# - Parameters: maxIter, regParam, elasticNetParam
# - Metrics: RMSE, r2, mae
# - Artifacts: The underlying Spark model
# Custom Logging
ملفلو . log_param ( "custom_param", "value")
Distributed Context Considerations
When running distributed training (e.g., Spark MLlib, Horovod, or TorchDistributor):
- Driver Role: The Spark Driver handles the communication with the Tracking Server. It aggregates metrics and parameters before sending them to MLflow.
- Executor Role: Executors perform the heavy lifting. Logs generated on executors are not automatically sent to MLflow. Use
spark.sparkContext.setLocalPropertyto associate Spark jobs with MLflow runs for better traceability in the Spark UI.
Model Registry and Lifecycle
The MLflow Model Registry provides a centralized repository for managing model versions, stage transitions, and annotations.
Programmatic Model Management
Use the MlflowClient for CI/CD integration and automated pipeline orchestration.
من ملفلو . تتبع استورد MlflowClient
عميل = MlflowClient ( )
model_name = "production.fraud_detection"
# 1. Register a new model
عميل . create_registered_model ( model_name )
# 2. Create a version from an existing run
run_id = "a1b2c3d4e5f6..."
نتيجة = عميل . create_model_version (
اسم = model_name ,
source= f"runs:/{ run_id} /model",
run_id= run_id
)
# 3. Transition to Staging
عميل . transition_model_version_stage(
اسم = model_name ,
الإصدار = نتيجة . الإصدار ,
stage= "Staging"
)
# 4. Add Metadata
عميل . update_model_version(
اسم = model_name ,
الإصدار = نتيجة . الإصدار ,
وصف = "Random Forest classifier trained on Q3 data."
)
Batch Inference with Spark
One of the most powerful features of Spark/MLflow integration is the ability to run batch inference on large datasets using spark_udf. This mechanism broadcasts the model to all executors, allowing for parallel scoring.
Implementation using spark_udf
ال mlflow.pyfunc.spark_udf function loads the model as a Spark User Defined Function (UDF).
استورد ملفلو . pyfunc
من بايسبارك . SQL . types استورد ArrayType, FloatType
# Load Production Model
model_name = "production.fraud_detection"
model_stage = "Production"
model_uri = f"models:/{ model_name } / { model_stage} "
# Define the UDF
# The result_type depends on your model output (e.g., DoubleType, ArrayType)
predict_udf = ملفلو . pyfunc. spark_udf(
شراره ,
model_uri,
result_type= ArrayType( FloatType( ) )
)
# Load Batch Data
مدافع = شراره . جدول ( "raw_transactions")
# Apply Inference
# Input columns must match the model's expected signature
predictions = مدافع . withColumn( "prediction", predict_udf( "feature_1", "feature_2", "feature_3") )
# Write Results
predictions. يكتب . format( "delta") . طريقة ( "append") . save( "/data/scored_transactions")
Performance Optimization
- Broadcasting: MLflow automatically broadcasts the model artifact to executors. Ensure executor memory overhead is sufficient for loading large models (e.g., Large LLMs or Deep Learning models).
- Vectorization:
spark_udfutilizes Arrow for optimized serialization between JVM and Python processes, significantly improving throughput compared to standard Python UDFs.
Artifact Storage & Access Control
S3 Backend Configuration
Ilum abstracts the S3 configuration. However, for advanced scenarios involving external buckets or cross-region replication, you can override the default storage endpoint.
Spark jobs deployed via Ilum automatically inherit the following configurations:
spark.hadoop.fs.s3a.endpointspark.hadoop.fs.s3a.access.keyspark.hadoop.fs.s3a.secret.key
Access Governance
- Network Policies: Access to the MLflow tracking server is governed by Kubernetes NetworkPolicies. By default, only pods within the Ilum namespaces can communicate with the tracking service.
- إذن : While MLflow Open Source does not provide granular RBAC, Ilum restricts access to the MLflow UI and API via the ingress gateway and platform-level authentication.
استكشاف الاخطاء
المشكلات الشائعة
- Connection Refused: Verify the
MLFLOW_TRACKING_URIis resolvable from within the Spark Driver pod. - S3 Access Denied: Ensure the ServiceAccount associated with the Spark job has the necessary IAM roles or S3 policies attached.
- Missing Dependencies: If using a custom image, ensure
ملفلووboto3versions are compatible with the server version running in Ilum.