أباتشي سبارك

Apache Spark is the default execution engine for distributed data processing in Ilum. It runs on Kubernetes (with native CRD-based pod orchestration) or Apache Hadoop Yarn, and is exposed through batch jobs, interactive sessions, in-app SQL notebooks, and the Apache Kyuubi SQL gateway.

Ilum bundles Apache Spark 4.x by default, with Spark 3.x available for legacy workloads.

When to use Spark

Spark is the right engine for:

Large-scale ETL and data transformation pipelines.
Machine learning workloads using Spark ML or MLlib.
Complex joins and aggregations across large datasets.
Streaming workloads with Spark Structured Streaming.
Workloads that benefit from horizontal scaling across many executors.

For interactive analytics on medium-to-large data, consider الثلاثي. For small-data and local execution, consider بطة دي بي. For low-latency stream processing, consider Apache Flink.

Execution model

Spark runs as a driver and a configurable number of executors:

Driver pod: One per job. Coordinates execution, holds the Spark session, and tracks task state.
Executor pods: Provisioned dynamically based on workload. Run individual tasks in parallel and hold cached data.

Ilum manages the full pod lifecycle, including image selection, resource limits, dynamic allocation, and cleanup on completion.

Workload types

Spark powers four kinds of workloads in Ilum:

وظائف: One-shot batch executions.
خدمات: Long-running interactive Spark sessions that execute code on demand without per-call initialization overhead.
جداول: Cron-driven recurring jobs.
Requests: Ad-hoc submissions through the REST API or UI.

All four are managed through the عبء العمل section of the Ilum UI.

Supported catalogs

Spark connects to all four Ilum catalogs:

Hive Metastore (default)
مشروع نيسي (Iceberg with Git-style branching)
كتالوج الوحدة (Databricks-compatible governance)
DuckLake (DuckDB-native, primarily used by DuckDB)

Supported table formats

Spark reads and writes:

بحيرة دلتا: ACID transactions, time travel, schema evolution.
أباتشي آيسبرغ: Partition evolution, hidden partitioning.
أباتشي هودي: Record-level upserts, incremental processing.
Parquet, ORC, CSV, JSON, Avro: Standard file formats.

ال جداول Ilum abstraction lets you read and write Delta, Iceberg, and Hudi using the same Spark API.

تكوين

Spark configuration is managed through Helm values and per-cluster settings:

إيلوم كور:
  شراره:
    تمكين: صحيح
  عنقود:
    defaults:
      spark.dynamicAllocation.enabled: "صحيح"
      spark.dynamicAllocation.minExecutors: "1"
      spark.dynamicAllocation.maxExecutors: "20"
      spark.dynamicAllocation.executorIdleTimeout: "60s"

Per-cluster overrides are configured in the Workloads > Clusters UI and apply to all Spark jobs targeting that cluster.

سبارك كونكت

Spark Connect provides a client-server architecture for remote Spark execution. Ilum deploys Spark Connect servers as standard jobs and includes a Kubernetes-aware proxy that allows Spark Connect endpoints to be reached across cluster boundaries.

الرجوع إلى سبارك كونكت for details.

Submitting a Spark job

For a step-by-step walkthrough, refer to Run a simple Spark job.

When to use Spark​

Execution model​

Workload types​

Supported catalogs​

Supported table formats​

تكوين​

سبارك كونكت​

Submitting a Spark job​

Related pages​