MLflow Integration

System Architecture

Ilum integrates MLflow as a managed service within the Kubernetes cluster, providing a centralized control plane for MLOps workflows. The architecture consists of three primary components:

Tracking Server: A centralized REST API server (ilum-mlflow-tracking) handling metadata and artifact requests.
Backend Store: A PostgreSQL database managing entity metadata (runs, parameters, metrics, tags, and model registry data).
Artifact Store: An S3-compatible object storage layer (defaulting to MinIO or external S3) for persisting large artifacts such as model binaries, plots, and datasets.

Network Topology

In the Ilum ecosystem, Spark Drivers and Executors interact with MLflow through internal Kubernetes networking.

Metadata Operations: The Spark Driver communicates directly with the Tracking Server via HTTP/REST.
Artifact Operations: Both Drivers and Executors interact with the S3 Artifact Store to upload/download model files. Credentials and endpoints are automatically injected into the Spark environment.

Configuration and Deployment

Helm Configuration

MLflow is enabled via the core Ilum Helm chart. The deployment ensures that the tracking server is co-located with the necessary storage configurations.

ترقية Helm \
  --جبر mlflow.enabled=صحيح \
  - إعادة استخدام قيم ilum ilum / ilum

This configuration provisions:

A Kubernetes Deployment for the MLflow server.
A ClusterIP Service exposing port 5000.
Auto-wired secrets for database and S3 access.

Spark Runtime Environment

To utilize MLflow features, Spark workloads must run in an environment containing the MLflow Python client and compatible dependencies. Ilum provides optimized Docker images for this purpose.

Default Image: ILUM / شرارة: 3.5.3 ملتدفق

This image includes:

ملفلو (Python client)
boto3 (AWS SDK for S3 artifact access)
psycopg2 (PostgreSQL adapter)
Java dependencies for S3A file system support

Job Configuration

For custom Spark jobs or interactive sessions, define the image in the Kubernetes specification:

المواصفات:
  حاويات:
    - اسم: شراره-kubernetes-سائق
      صورة: إيلوم / شرارة:3.5.3-ملفلو

تتبع التجربة

Ilum pre-configures the environment variables required for tracking. The MLFLOW_TRACKING_URI تم ضبطه على http://ilum-mlflow-tracking:5000 بشكل افتراضي.

Spark Autologging

For Spark MLlib workloads, mlflow.spark.autolog() captures datasource information, Spark configurations, and model metrics automatically.

من بايسبارك.SQL استورد جلسة سبارك
من بايسبارك.ml.regression استورد LinearRegression
من بايسبارك.ml.feature استورد VectorAssembler
استورد ملفلو
استورد ملفلو.شراره

# Initialize Spark
شراره = جلسة سبارك.builder.appName("MLflow_Example").getOrCreate()

# Enable Spark Autologging
ملفلو.شراره.autolog()

# Prepare Data
بيانات = شراره.قرأ.format("libsvm").load("sample_libsvm_data.txt")
train_data, test_data = بيانات.randomSplit([0.7, 0.3])

# Define Experiment
ملفلو.set_experiment("/Ilum/SparkExperiments")

مع ملفلو.start_run(run_name="LinearRegression_v1"):
    # Training
    lr = LinearRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
    نموذج = lr.fit(train_data)

    # Autologging captures:
    # - Parameters: maxIter, regParam, elasticNetParam
    # - Metrics: RMSE, r2, mae
    # - Artifacts: The underlying Spark model

    # Custom Logging
    ملفلو.log_param("custom_param", "value")

Distributed Context Considerations

When running distributed training (e.g., Spark MLlib, Horovod, or TorchDistributor):

Driver Role: The Spark Driver handles the communication with the Tracking Server. It aggregates metrics and parameters before sending them to MLflow.
Executor Role: Executors perform the heavy lifting. Logs generated on executors are not automatically sent to MLflow. Use spark.sparkContext.setLocalProperty to associate Spark jobs with MLflow runs for better traceability in the Spark UI.

Model Registry and Lifecycle

The MLflow Model Registry provides a centralized repository for managing model versions, stage transitions, and annotations.

Programmatic Model Management

Use the MlflowClient for CI/CD integration and automated pipeline orchestration.

من ملفلو.تتبع استورد MlflowClient

عميل = MlflowClient()
model_name = "production.fraud_detection"

# 1. Register a new model
عميل.create_registered_model(model_name)

# 2. Create a version from an existing run
run_id = "a1b2c3d4e5f6..."
نتيجة = عميل.create_model_version(
    اسم=model_name,
    source=f"runs:/{run_id}/model",
    run_id=run_id
)

# 3. Transition to Staging
عميل.transition_model_version_stage(
    اسم=model_name,
    الإصدار=نتيجة.الإصدار,
    stage="Staging"
)

# 4. Add Metadata
عميل.update_model_version(
    اسم=model_name,
    الإصدار=نتيجة.الإصدار,
    وصف="Random Forest classifier trained on Q3 data."
)

Batch Inference with Spark

One of the most powerful features of Spark/MLflow integration is the ability to run batch inference on large datasets using spark_udf. This mechanism broadcasts the model to all executors, allowing for parallel scoring.

Implementation using `spark_udf`

ال mlflow.pyfunc.spark_udf function loads the model as a Spark User Defined Function (UDF).

استورد ملفلو.pyfunc
من بايسبارك.SQL.types استورد ArrayType, FloatType

# Load Production Model
model_name = "production.fraud_detection"
model_stage = "Production"
model_uri = f"models:/{model_name}/{model_stage}"

# Define the UDF
# The result_type depends on your model output (e.g., DoubleType, ArrayType)
predict_udf = ملفلو.pyfunc.spark_udf(
    شراره, 
    model_uri, 
    result_type=ArrayType(FloatType())
)

# Load Batch Data
مدافع = شراره.جدول("raw_transactions")

# Apply Inference
# Input columns must match the model's expected signature
predictions = مدافع.withColumn("prediction", predict_udf("feature_1", "feature_2", "feature_3"))

# Write Results
predictions.يكتب.format("delta").طريقة("append").save("/data/scored_transactions")

Performance Optimization

Broadcasting: MLflow automatically broadcasts the model artifact to executors. Ensure executor memory overhead is sufficient for loading large models (e.g., Large LLMs or Deep Learning models).
Vectorization: spark_udf utilizes Arrow for optimized serialization between JVM and Python processes, significantly improving throughput compared to standard Python UDFs.

Artifact Storage & Access Control

S3 Backend Configuration

Ilum abstracts the S3 configuration. However, for advanced scenarios involving external buckets or cross-region replication, you can override the default storage endpoint.

Spark jobs deployed via Ilum automatically inherit the following configurations:

spark.hadoop.fs.s3a.endpoint
spark.hadoop.fs.s3a.access.key
spark.hadoop.fs.s3a.secret.key

Access Governance

Network Policies: Access to the MLflow tracking server is governed by Kubernetes NetworkPolicies. By default, only pods within the Ilum namespaces can communicate with the tracking service.
إذن: While MLflow Open Source does not provide granular RBAC, Ilum restricts access to the MLflow UI and API via the ingress gateway and platform-level authentication.

استكشاف الاخطاء

المشكلات الشائعة

Connection Refused: Verify the MLFLOW_TRACKING_URI is resolvable from within the Spark Driver pod.
S3 Access Denied: Ensure the ServiceAccount associated with the Spark job has the necessary IAM roles or S3 policies attached.
Missing Dependencies: If using a custom image, ensure ملفلو و boto3 versions are compatible with the server version running in Ilum.

System Architecture​

Network Topology​

Configuration and Deployment​

Helm Configuration​

Spark Runtime Environment​

Job Configuration​

تتبع التجربة​

Spark Autologging​

Distributed Context Considerations​

Model Registry and Lifecycle​

Programmatic Model Management​

Batch Inference with Spark​

Implementation using spark_udf​

Performance Optimization​

Artifact Storage & Access Control​

S3 Backend Configuration​

Access Governance​

استكشاف الاخطاء​

المشكلات الشائعة​