كتالوج نيسي

نظره عامه

Project Nessie هو مفتوح المصدر كتالوج بيانات المعاملات that brings Git-like version control to data lakes. It enables you to manage multiple versions of your data using branches, tags, and commits, similar to how Git manages source code.

With Ilum’s integration, you can leverage Nessie’s version control features directly in your Spark environment. This allows you to branch, tag, and merge data changes safely and efficiently.

Unlike traditional Hive or Glue catalogs, which only track the latest state of each table, Nessie records جميع التغييرات حسب الالتزام in a timeline. Each commit represents a consistent snapshot of your data lake. Changes are isolated until committed, ensuring incomplete or in-progress updates are never visible to other users or jobs. Once finalized, changes become atomically visible, guaranteeing consistency.

Key Features: Nessie vs. Traditional Catalogs

ميزة	Traditional Catalogs	كتالوج نيسي
Branching	لا	Yes (Git-like)
Isolated Environments	Manual/Complex	Simple, via branches
Commit History & Time Travel	Limited/Per-table	Full catalog history
المعاملات متعددة الجداول	لا	Yes (atomic commits)
Collaboration & Governance	الحد الادني	Built-in, audit log

Highlights

Branching: Create multiple isolated branches (e.g., رئيسي, ديف, staging) without duplicating data. Branches are lightweight pointers to metadata snapshots.
البيئات المعزولة: Use the same data lake for dev, staging, and prod by isolating changes in branches. No need for separate catalogs or data copies.
الالتزام بالتاريخ والسفر عبر الزمن: Nessie maintains a unified commit log. Inspect, audit, or time-travel to any previous state by commit hash or timestamp.
المعاملات الذرية متعددة الجداول: Commit changes across multiple tables as a single atomic operation. All succeed or none do.
Collaboration & Governance: Work on separate branches, merge changes, and track who changed what and when. Enables safe experimentation and robust auditability.

Core Concepts

فروع

A فرع is an independent line of development for your data catalog. Branches start as copies of existing branches and track changes separately. They are lightweight, referencing the same data files but different metadata. The default branch is usually رئيسي.

العلامات

A العلامه is a read-only label pointing to a specific commit. Use tags to mark stable versions or important milestones (e.g., الإصدار 1.0, إصدار 2025-06). Tags are immutable bookmarks.

Commits

A ارتكب is a set of changes recorded as a single atomic unit. Each commit has a unique ID, timestamp, author, and optional message. The commit log provides full catalog versioning.

استخدام Nessie في Ilum

تحذير

Nessie is not enabled by default in Ilum. To enable it, see the production صفحة.

يدعم Ilum مشروع نيسي as a catalog for version-controlled data management. When using Ilum notebooks or Spark jobs with Apache Iceberg, Git-like operations (branching, merging, tagging) can be performed directly via SQL.

There are two ways to wire Nessie into a cluster: the chart-managed flow, recommended for Helm deployments, and the manual flow, for self-managed Spark images or clusters configured outside the chart.

ملاحظه

Nessie SQL operations (USE BRANCH, CREATE BRANCH, MERGE BRANCH, and the rest of the walkthrough below) are supported on both the Spark 3.5 and Spark 4.x image lines. Use the matching ilum/spark:-nessie image, for example ilum/spark:3.5.8-nessie أو ilum/spark:4.1.2-nessie. The Spark 4.x image ships nessie-spark-extensions-4.0_2.13, which is ABI-compatible with the Spark 4.1 line (no native 4.1 build is published yet), mirroring the iceberg-spark-runtime-4.0_2.13 it pairs with.

Chart-managed configuration (recommended)

When Ilum is deployed with the Helm chart, the Nessie catalog is configured declaratively. Enable the metastore and set its type to نيسي في المربع إيلوم كور values (the example uses the umbrella-chart key; in the standalone helm_core chart the same keys live at the top level):

إيلوم كور:
  ميتا ستور:
    تمكين: صحيح
    نوع: نيسي
    نيسي:
      address: HTTP://ilum-نيسي:19120
      warehouseDir: s3a://ilum-data/nessie_catalog/
      s3Endpoint: HTTP://ilum-objectstorage:9000/
      s3PathStyleAccess: صحيح
      الرقم المرجعي: رئيسي
      catalogName: nessie_catalog
نيسي:
  تمكين: صحيح

The Iceberg + Nessie catalog wiring — spark.sql. الامتداداتال SparkCatalog provider, catalog-impl (NessieCatalog), and io-impl (S3FileIO) — is generated by ilum-core under the configured catalogName; it does not need to be set in values. Ilum injects it, together with the connection settings above (URI, الرقم المرجعي, warehouse, S3 endpoint, path-style, and region) and the catalog's S3 credentials (taken from the cluster's object-storage credentials, never persisted on the metastore), into every Spark submission on a cluster that has the metastore attached. No catalog wiring, extraJavaOptions, or credentials need to be set on the cluster or in the notebook session — a chart-managed Nessie catalog works out of the box.

مع metastore.enabled: true, the bundled افتراضي cluster is attached to this metastore automatically, so jobs on it resolve the catalog out of the box. The catalog is addressed by metastore.nessie.catalogName (default nessie_catalog); tables are referenced as nessie_catalog..

throughout the SQL examples below.

To attach the metastore to another cluster, open its Edit Cluster tab and select Nessie in the General metastore dropdown:

Catalog Selection

Manual configuration

For a cluster running a self-managed Spark image, or when Nessie is configured outside the chart, the nessie_catalog must be pre-configured in the Spark session.

Make sure the Spark image used in the cluster has the Nessie client installed. In particular, the following are required:

Iceberg Spark Runtime (org.apache.iceberg:iceberg-spark-runtime-_) - Required for Nessie support.
Iceberg AWS Bundle (org.apache.iceberg:iceberg-aws-bundle) - Required for S3 support.
Nessie SQL Extensions (org.projectnessie.nessie-integrations:nessie-spark-extensions-_) - Required for Nessie-specific SQL operations.

إيلوم custom Spark image for Nessie: ilum/spark:-nessie includes all the required dependencies.

ملاحظه

When the catalog أوري targets the Nessie server's /api/v2 endpoint, also set spark.sql.catalog..client-api-version=2. The Iceberg-bundled Nessie client otherwise defaults to API v1 and the first catalog call fails with NessieApiCompatibilityException: API version mismatch.

تحذير

Ilum’s spark-nessie image does not include any Delta table dependencies, so the default cluster configuration for Delta tables must be removed when using this image (on a chart deployment, override these via kubernetes.defaultCluster.config). In particular:

اسم	قيمة
`spark.sql.catalog.spark_catalog`	`org.apache.spark.sql.delta.catalog.DeltaCatalog`
`spark.sql. الامتدادات`	`io.delta.sql.DeltaSparkSessionExtension`

Nessie Walkthrough

In the beginning, it is recommended to create anything inside the main branch so that you avoid problems with merging into an empty branch:

خلق جدول nessie_catalog.المستعملون(
    user_id الباحث,
    user_name فارشار(20)  
);

Create a Branch

خلق تطوير الفرع في nessie_catalog من رئيسي;

And to verify everything, list all branches and tags:

قائمة مراجع في nessie_catalog;

Work on a Branch

Create a table and insert data in the ديف branch with the fully qualified name (@ أو @):

خلق جدول nessie_catalog.`sales@dev`(
    sale_timestamp CHAR(10),
    sale_amount الباحث,
    payment_method فارشار(20)
);

أدخل إلى nessie_catalog.`sales@dev` القيم
    ('2025-06-01', 1000, "عبر الإنترنت"),
    ('2025-06-02', 1500, 'InStore'),
    ('2025-06-03', 800, "عبر الإنترنت"),
    ('2025-06-04', 1200, 'Mobile'),
    ('2025-06-05', 950, 'InStore');
اختار عد(*) من nessie_catalog.`sales@dev`;

Or use the استخدام statement to switch a context to a specific branch:

استخدام تطوير الفرع في nessie_catalog;
اختار عد(*) من nessie_catalog.مبيعات;

ملاحظه

Because Ilum’s SQL executor treats each query as a stateless entity, using the استخدام statement requires executing all related statements together.
To do this, select the entire query block in the editor and then press أعدم.

And show the log of all commits done:

عرض LOG على ديف في nessie_catalog;

Merge Branches

دمج تطوير الفرع إلى رئيسي في nessie_catalog;
عرض المناضد في nessie_catalog;

ملاحظه

If you see an error of No common ancestor in parents of and , this can mean that the branch you are trying to merge into is empty. This will cause the merge to fail, even if the branch you are trying to merge was correctly created from the parent branch.

Cataloging Nessie Tables in OpenMetadata

Iceberg tables managed in Nessie can be surfaced in OpenMetadata — Ilum's metadata catalog and governance layer — so that branched, committed, version-controlled tables become discoverable, classifiable, and lineage-connected assets alongside the rest of the data platform.

OpenMetadata reads Nessie through its Iceberg-REST endpoint (base URI http://ilum-nessie:19120/iceberg), cataloging the tables visible on a single reference. The reference is selected by the warehouse (default nessie_catalog → ref رئيسي), not by the URI path. Nessie's branch/tag/commit history is flattened — OpenMetadata sees only the current state of that one reference, not the version timeline. The integration is opt-in and enabled with two values:

نيسي:
  تمكين: صحيح
openmetadataBootstrap:
  خدمات:
    مثلجة:
      تمكين: صحيح

For the connector details, branch-semantics caveats, object-storage credential note, and full enabling reference, see Iceberg tables via Project Nessie on the OpenMetadata page.

ملاحظه

This integration is newly added and opt-in. Its configuration is validated against the chart; the end-to-end result (a Nessie-committed Iceberg table appearing in the OpenMetadata catalog) is pending live verification. Treat it as a preview until confirmed.

أفضل الممارسات

Develop in Isolation: Use branches for development or experiments. Promote changes through a hierarchy (e.g., dev → staging → main).
Merge Frequently: Merge changes regularly to minimize conflicts.
حافظ على الفروع قصيرة الأجل: Remove feature branches after merging.
Avoid Conflicts: Sync your branch with the latest target branch before merging.
Tag Milestones: Use tags for stable releases or important checkpoints.
Document Changes: Add commit messages for traceability.

التعرف على المزيد

For advanced SQL operations and the full Nessie Spark SQL reference, see:
👉 مرجع Nessie Spark SQL

Nessie with Ilum combines Spark’s power with Git-like data management, enabling robust “data as code” workflows for your lakehouse.

نظره عامه
Key Features: Nessie vs. Traditional Catalogs
- Highlights
Core Concepts
استخدام Nessie في Ilum
Cataloging Nessie Tables in OpenMetadata
أفضل الممارسات
التعرف على المزيد

نظره عامه​

Key Features: Nessie vs. Traditional Catalogs​

Highlights​

Core Concepts​

فروع​

العلامات​

Commits​

استخدام Nessie في Ilum​

Chart-managed configuration (recommended)​

Manual configuration​

Nessie Walkthrough​

Create a Branch​

Work on a Branch​

Merge Branches​

Cataloging Nessie Tables in OpenMetadata​

أفضل الممارسات​

التعرف على المزيد​

نظره عامه

Key Features: Nessie vs. Traditional Catalogs

Highlights

Core Concepts

فروع

العلامات

Commits

استخدام Nessie في Ilum

Chart-managed configuration (recommended)

Manual configuration

Nessie Walkthrough

Create a Branch

Work on a Branch

Merge Branches

Cataloging Nessie Tables in OpenMetadata

أفضل الممارسات

التعرف على المزيد