Run dbt Core on Spark (Kubernetes)

This guide explains how to set up dbt Core مع أباتشي سبارك running on a Kubernetes cluster. Using Ilum as the execution engine, you can run scalable data transformation pipelines directly on your data lake.

You have two primary ways to connect dbt to Spark on Ilum:

Thrift Server vs. Spark Connect

ميزة	Method 1: Spark Thrift Server (Legacy)	Method 2: Spark Connect (Modern)
Protocol	JDBC/ODBC (via HiveDriver)	gRPC (via Spark Connect)
Connection Type	`طريقة: التوفير`	`طريقة: الجلسة`
معمار	Requires a dedicated Thrift Server pod	Connects directly to Spark Driver
Performance	Higher latency (row-based serialization)	High performance (Arrow-based)
أفضل ل	BI Tools (Tableau, PowerBI), Legacy apps	Data Engineering, Python/dbt pipelines

New to Spark Connect?

For a deep dive into the architecture, check out our Spark Connect on Kubernetes Guide.

المتطلبات المسبقه

Before starting, ensure your development environment is ready:

مجموعة Kubernetes: You need a running K8s cluster (GKE, EKS, AKS, or Minikube).
Tools:
- دومان (for deploying Ilum).
- كوبيكتل (configured to access your cluster).
- Python 3.8+ (for running dbt Core).
Knowledge: Basic understanding of dbt projects and Spark concepts.

How to Configure dbt with Spark on Kubernetes

Choose your preferred connection method:

Method 1: Thrift Server
Method 2: Spark Connect

Step 1: Deploy Spark Thrift Server

Deploy Ilum with the SQL module (acting as a scalable Thrift server) and Hive Metastore enabled:

Helm Install
helm repo جمع ilum https://charts.ilum.cloud
helm install ilum ilum / ilum \
  --جبر ilum-hive-metastore.enabled=صحيح \
  --جبر ilum-core.metastore.enabled=صحيح \
  --جبر ilum-core.metastore.type=خلية \
  --جبر ilum-sql.enabled=صحيح \
  --جبر ilum-core.sql.enabled=صحيح

Step 2: Connect to the Thrift Service

1. Identify the service:

Get Service

kubectl get خدمة

ابحث عن الخدمة التي تحتوي على "sql-thrift-binary" في اسمها.

2. Port-forward:

Port Forward
kubectl port-forward svc/ilum-sql-thrift-binary 10009:10009

هذا يجعل خادم التوفير متاحا على المضيف المحلي:10009.

3. Test with Beeline (optional):

beeline -u "jdbc:hive2://localhost:10009/default"

ركض:

عرض المناضد;

توقع قائمة فارغة أو جداول موجودة.

Configuring and Running dbt

1. Clean Environment (if needed):

Uninstall Conflict

pip uninstall dbt-spark pyspark -y

2. Install dbt and dependencies:

Install dbt-spark
pip install بايسبارك==3.5.8
pip install DBT-core
pip install "dbt-spark[PyHive,session]"
pip install --upgrade الادخار

3. Verify installation:

Verify dbt

ديسيبل تي --version

Create dbt Project

1. Initialize a dbt project:

Init Project

ilum_dbt_project DBT init

2. Answer the setup prompts:

Setup Prompts
Which database? 1 (شراره)
host: localhost
Desired authentication method: 3 (الادخار)
port: 10009
schema: default
threads: 1

This creates the ilum_dbt_project directory and a profiles.yml file in ~/.dbt/.

Configure dbt for Ilum

حرر ~ / .dbt / profiles.yml to include both Thrift and Spark Connect targets:

~ / .dbt / profiles.yml
ilum_dbt_project:
  هدف: الادخار  # Default target
  النواتج:
    الادخار:
      نوع: شراره
      أسلوب: الادخار
      مضيف: المضيف المحلي
      ميناء: 10009
      مخطط: افتراضي
      المواضيع: 1
      connect_retries: 5
      connect_timeout: 60
      connect_args:
        عنوان URL: "JDBC: Hive2://localhost:10009/default; transportMode = ثنائي ؛ hive.server2.transport.mode=ثنائي"
        سائق: "org.apache.hive.jdbc.HiveDriver"
        المصادقه: "لا شيء"
    
    spark_connect:
      نوع: شراره
      أسلوب: جلسة
      مضيف: المضيف المحلي
      ميناء: 15002
      مخطط: افتراضي
      المواضيع: 1

Switch between targets:

Run dbt
# Use Thrift (default)
dbt run

# Use Spark Connect
dbt run --target spark_connect

# Or set default in dbt_project.yml
# target: spark_connect

اتصال الاختبار:
Debug dbt
```
cd ilum_dbt_project
تصحيح أخطاء DBT
```
تأكد من عدم ظهور أي أخطاء ، مما يشير إلى اتصال ناجح بخادم Thrift.

Create a Model to Write Data

إنشاء نموذج:

نماذج/sample_data.sql

نماذج/sample_data.sql
{{ تكوين(تتحقق="الجدول") }}

اختار 
  معرف,
  اسم
من (
  القيم 
    (1, "أليس"),
    (2, "بوب")
) مثل t(معرف, اسم)

تشغيل الموديل:
Run sample_data
```
dbt run --select sample_data
```

Create a Model to Read Data

إنشاء نموذج:

نماذج/read_data.sql

نماذج/read_data.sql
{{ تكوين(تتحقق="الجدول") }}

اختار 
  معرف,
  اسم,
  طول(اسم) مثل name_length
من {{ المرجع('sample_data') }}

تشغيل الموديل:
Run read_data
```
dbt run --select read_data
```

Verify Results

1. Monitor Job in Ilum UI:

Access the Ilum UI (URL provided in your Ilum setup, e.g. port-forward)
Navigate to the Jobs section
Look for the job named ilum-sql-spark-engine

Check job status, logs, and execution details to confirm successful processing

2. Query with Beeline:

beeline -u "jdbc:hive2://localhost:10009/default"

3. Run query:

اختار * من افتراضي.read_data;

Expected output:

+----+-----+------------+
| معرف | الاسم | name_length |
+----+-----+------------+
| 1 | أليس | 5 |
| 2 | بوب | 3 |
+----+-----+------------+

سبارك كونكت is the recommended way for modern data engineering teams to run dbt on Kubernetes. It eliminates the need for a heavy intermediate Thrift Server, reducing costs and complexity.

Step 1: Deploy Spark Connect Job

تسجيل الدخول إلى واجهة مستخدم Ilum
Navigate to Workloads → وظائف قسم
Click "New Job" زر
Configure the job:
- اسم: spark-connect-dbt
- Job Type: Select Spark Connect Job
Add Spark Connect dependency (if needed):

Most Spark distributions don't include Spark Connect by default. You'll need to add it as a package dependency.
- Click the تكوين tab
- في المربع البارامترات section, click Add Parameter
- Add the following parameter:
Key قيمة
spark.jars.packages org.apache.spark:spark-connect_2.12:3.5.8

ملاحظه
استبدل 2.12 with your Scala version and 3.5.8 with your Spark version to match your environment.
Click Submit

The server starts successfully when you see this in the logs:

Spark Connect server started at: 0:0:0:0:0:0:0:0%0:15002

الاتصال بخادم Spark Connect

Get the Connection URL

After the job starts, Ilum provides a Spark Connect URL on the job details page.

The URL format is: sc://job-xxxxx-driver-svc:15002

Port-Forward for Local Access

To connect from your local machine, forward the driver pod's port:

Find the driver pod name from the سجلات tab in Ilum UI

مثل: If URL is sc://job-20250807-1557-ablr2a52vxd-driver-svc:15002,
the pod name is job-20250807-1557-ablr2a52vxd-driver (remove -svc suffix)
الميناء إلى الأمام:
Port Forward
```
kubectl ميناء إلى الأمام <driver-pod-name> 15002:15002
```
Keep this terminal window open.

Create dbt Project

تهيئة مشروع dbt (إذا لزم الأمر):

Init Connect Project

ديسيبل إنيت ilum_dbt_spark_connect_project

Answer the setup prompts:

Setup Prompts
Which database? 1 (شراره)
host: localhost
Desired authentication method: 4 (جلسة) #or 3 if u can't see session
port: 15002
schema: default
threads: 1

This creates the ilum_dbt_spark_connect_project directory and updates ~ / .dbt / profiles.yml.

Configure dbt for Spark Connect

If you followed the Thrift setup above, your ~ / .dbt / profiles.yml already has both targets configured. You can use the same ilum_dbt_project profile.

To use Spark Connect, simply specify the target:

Run Connect
cd ilum_dbt_project  # Use the same project as Thrift
تصحيح أخطاء DBT --target spark_connect
dbt run --target spark_connect

Or create a separate project (if you prefer isolation):

حرر ~ / .dbt / profiles.yml:

~ / .dbt / profiles.yml
ilum_dbt_spark_connect_project:
  هدف: ديف
  النواتج:
    ديف:
      نوع: شراره
      أسلوب: جلسة
      مضيف: المضيف المحلي
      ميناء: 15002
      مخطط: افتراضي
      المواضيع: 1

Test the connection:

Debug Connect
cd ..
cd ilum_dbt_spark_connect_project
تصحيح أخطاء DBT

بقشيش

Recommended approach: Use one dbt project with multiple targets (as shown in the Thrift section). This allows you to switch between Thrift and Spark Connect without maintaining separate projects.

You should see successful connection messages.

Create a Model to Write Data

إنشاء نموذج:

نماذج/sample_data_connect.sql

نماذج/sample_data_connect.sql
{{ تكوين(تتحقق="الجدول") }}

اختار 
  معرف,
  اسم
من (
  القيم 
    (1, "بيتر"),
    (2, "جون")
) مثل t(معرف, اسم)

تشغيل الموديل:

Run sample_data_connect
dbt run --select sample_data_connect --target spark_connect

Create a Model to Read Data

إنشاء نموذج:

نماذج/read_data_connect.sql

نماذج/read_data_connect.sql
{{ تكوين(تتحقق="الجدول") }}

اختار 
  معرف,
  اسم,
  طول(اسم) مثل name_length
من {{ المرجع('sample_data_connect') }}

تشغيل الموديل:
Run read_data_connect
```
dbt run --select read_data_connect --target spark_connect
```
ملاحظه
ال --target spark_connect flag ensures dbt uses the Spark Connect configuration instead of the default Thrift target.

Verify Results

مراقبة الوظيفة في واجهة مستخدم Ilum:
- قم بالوصول إلى واجهة مستخدم Ilum (عنوان URL الموجود في إعداد Ilum الخاص بك ، على سبيل المثال port-forward).
- انتقل إلى قسم الوظائف.
- ابحث عن الوظيفة المسماة spark-connect.
- تحقق من حالة الوظيفة والسجلات وتفاصيل التنفيذ لتأكيد المعالجة الناجحة.
طباعة البيانات في وظيفة dbt: للتحقق من وصول البيانات إلى مستودع Spark (على سبيل المثال، مستودع شرارة / read_data_connect بالنسبة إلى دليل المشروع الخاص بك)، قم بإنشاء ماكرو DBT وقم بتشغيل عملية مخصصة للاستعلام عن الملف وطباعته read_data_connect محتويات الجدول أثناء مهمة DBT.

قم بإنشاء ملف ماكرو في دليل مشروع dbt الخاص بك:
- وحدات الماكرو / print_table.sql:
  وحدات الماكرو / print_table.sql
```
{% print_table الماكرو(table_name) %}
  {% جبر استفسار %}
    اختار * من {{ المرجع(table_name) }}
  {% النهاية %}
  {% فعل سجل("محتويات جدول الطباعة ل " ~ table_name ~ ':', صحيح) %}
  {% جبر النتائج = run_query(استفسار) %}
  {% لو النتائج %}
    {% من أجل صف في النتائج %}
      {% فعل سجل(صف, صحيح) %}
    {% endfor %}
  {% اخر %}
    {% فعل سجل("لم يتم العثور على بيانات في " ~ table_name, صحيح) %}
  {% endif %}
{% endmacro %}
```
  قم بتشغيل الماكرو لطباعة read_data_connect الجدول بعد نماذج DBT الخاصة بك:
  Run Macro
```
dbt run-operation print_table --args '{"table_name": "read_data_connect"}'
```
  ال تشغيل DBT ينفذ الأمر الماكرو، والاستعلام عن read_data_connect جدول وتسجيل محتوياته. الإخراج المتوقع في سجلات dbt أو وحدة التحكم:
```
محتويات جدول الطباعة ل read_data_connect:
<العقيق.الصف: (2 ، "جون" ، 4)>                                           
<العقيق.الصف: (1 ، "بطرس" ، 5)>
```
  ملاحظه
  The output appears in the dbt logs or console by default in dbt 1.9.4. For more detailed logs, you can use:
  Debug Macro
  dbt run-operation print_table --args '{"table_name": "read_data_connect"}' --log-level debug

Troubleshooting dbt-spark Connections

Common issues when connecting dbt to Spark on Kubernetes:

Error: "ThriftTransportException: Could not connect to localhost:10009"

Cause: The port forwarding tunnel is down or the Thrift Server pod is not running. حل:

Check if the Thrift pod is running: kubectl get pods -l app.kubernetes.io/name=ilum-sql
Restart port-forwarding: التفريغ منفذ إلى الأمام SVC / ILUM - SQL - التوفير ثنائي 10009: 10009

Error: "grpc._channel._InactiveRpcError: failed to connect to all addresses"

Cause: Your local dbt client cannot reach the Spark Connect gRPC port (15002). حل:

Ensure you have port-forwarded the Driver Pod, not the Service (unless using NodePort).
Verify you are using طريقة: الجلسة في profiles.yml.

Error: "AnalysisException: Table or view not found"

Cause: Hive Metastore connectivity issue. حل:

Ensure ilum-core.metastore.enabled=true was set during Helm install.
Check if the schema (database) exists in Spark: spark.sql("SHOW DATABASES").show()

تزامن

For production orchestration using Apache Airflow, see the dedicated guide: Orchestrate dbt with Airflow

Thrift Server vs. Spark Connect​

المتطلبات المسبقه​

How to Configure dbt with Spark on Kubernetes​

Step 1: Deploy Spark Thrift Server​

Step 2: Connect to the Thrift Service​

Configuring and Running dbt​

Create dbt Project​

Configure dbt for Ilum​

Create a Model to Write Data​

Create a Model to Read Data​

Verify Results​

Step 1: Deploy Spark Connect Job​

الاتصال بخادم Spark Connect​

Get the Connection URL​

Port-Forward for Local Access​

Create dbt Project​

Configure dbt for Spark Connect​

Create a Model to Write Data​

Create a Model to Read Data​

Verify Results​

Troubleshooting dbt-spark Connections​

تزامن​

Thrift Server vs. Spark Connect

المتطلبات المسبقه

How to Configure dbt with Spark on Kubernetes

Step 1: Deploy Spark Thrift Server

Step 2: Connect to the Thrift Service

Configuring and Running dbt

Create dbt Project

Configure dbt for Ilum

Create a Model to Write Data

Create a Model to Read Data

Verify Results

Step 1: Deploy Spark Connect Job

الاتصال بخادم Spark Connect

Get the Connection URL

Port-Forward for Local Access

Create dbt Project

Configure dbt for Spark Connect

Create a Model to Write Data

Create a Model to Read Data

Verify Results

Troubleshooting dbt-spark Connections

تزامن