> For the complete documentation index, see [llms.txt](https://v2.dataos.info/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://v2.dataos.info/concepts/resources/nilus/batch/batch-sources/delta-lake.md).

# Delta Lake

[Delta Lake](https://docs.delta.io/latest/index.html) is an open-source storage format that adds ACID transactions, schema enforcement, and time-travel to Parquet on object storage. Nilus reads Delta tables stored in a DataOS Lakehouse depot as a `batch` source, the runtime spins up DuckDB with the `delta` extension and exposes the Delta table as a view that the rest of the pipeline can extract from.

The Delta Lake source shares the bulk of its implementation with the Iceberg-on-DataOS Lakehouse sources; only the view-creation step differs (`delta_scan(...)` instead of `iceberg_scan(...)`). For Iceberg variants, see AWS-backed DataOS Lakehouse, Azure-backed DataOS Lakehouse, and GCP-backed DataOS Lakehouse. For writing **into** a Lakehouse, see the [AWS-backed DataOS Lakehouse](/concepts/resources/nilus/destinations/dataos-lakehouse/aws-backed.md), [Azure-backed DataOS Lakehouse](/concepts/resources/nilus/destinations/dataos-lakehouse/azure-backed.md), or [GCP-backed DataOS Lakehouse](/concepts/resources/nilus/destinations/dataos-lakehouse/gcp-backed.md) DataOS Lakehouse destination pages.

## Requirements

Connectivity and credentials must both be in place before the pipeline can run.

### Connectivity

* The Nilus runtime must reach the storage endpoint hosting the Delta table (S3, ABFSS, or WASBS).
* The depot's connection secret must carry credentials appropriate for the storage backend, AWS access/secret key for S3, Azure account name/key for ABFSS/WASBS. GCS-backed Delta tables are not currently supported through this connector.
* DuckDB extensions `httpfs`, `delta`, `parquet`, and the storage-specific extension (`aws` or `azure`) are installed at pipeline startup. No customer action required.

### Connection model

Nilus reads Delta tables **only via a DataOS Lakehouse depot**. There is no direct `delta://` URI customers can use, the depot is required because it carries the storage configuration (bucket / container, region, optional endpoint, optional relative-path prefix) that the source connector needs to register its DuckDB secret and view.

| Mode                          | Example                                 | Notes                                                                                        |
| ----------------------------- | --------------------------------------- | -------------------------------------------------------------------------------------------- |
| DataOS depot (only supported) | `dataos://my-deltalake-depot?purpose=r` | The depot resolves to a Delta-on-object-store connection. `purpose=r` indicates read intent. |

### Supported storage backends

| Storage type                          | Status          | Notes                                                                                                                           |
| ------------------------------------- | --------------- | ------------------------------------------------------------------------------------------------------------------------------- |
| Amazon S3 (`s3`)                      | ✅ Supported     | Requires `aws_region`; AWS access/secret key supplied via depot.                                                                |
| Azure Blob Storage (`abfss`, `wasbs`) | ✅ Supported     | Requires `container`; account name/key supplied via depot.                                                                      |
| Google Cloud Storage (`gcs`)          | ❌ Not supported | The Delta source path raises "Unsupported source lakehouse type" on `gcs`. Use the Iceberg + GCP-backed Lakehouse path instead. |

## Source options

| Option                | Required              | Description                                                                                                                                                                                                                                                                                    |
| --------------------- | --------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `source_table`        | Yes                   | Two-part Delta table name in `<collection>.<dataset>` form. The connector splits on `.` and builds the view as `SELECT * FROM delta_scan('<storage-uri>/<prefix>/<collection>/<dataset>')`. The collection corresponds to the namespace; the dataset corresponds to the Delta table directory. |
| `aws_region`          | Conditional (S3 only) | AWS region of the bucket. Required when the depot's connection secret omits `region`.                                                                                                                                                                                                          |
| `aws_endpoint`        | No (S3 only)          | Custom S3 endpoint (e.g. for VPC endpoints or S3-compatible storage).                                                                                                                                                                                                                          |
| `incremental_key`     | No                    | Timestamp or numeric column used to identify newly visible rows for each run. Use the dataset's commit-time column or a monotonic surrogate.                                                                                                                                                   |
| `interval_start`      | No                    | Optional ISO-8601 lower bound for the extraction window.                                                                                                                                                                                                                                       |
| `interval_end`        | No                    | Optional ISO-8601 upper bound for the extraction window.                                                                                                                                                                                                                                       |
| `page_size`           | No                    | Rows per extraction batch.                                                                                                                                                                                                                                                                     |
| `sql_limit`           | No                    | Caps total rows extracted per run. Useful for sampling and validation.                                                                                                                                                                                                                         |
| `sql_exclude_columns` | No                    | Comma-separated column names to skip during extraction.                                                                                                                                                                                                                                        |
| `type_hints`          | No                    | Object map of `column_name: <type>` to override inferred types.                                                                                                                                                                                                                                |

> **Note** A custom SQL surface is also available: pass `source_table: "query:SELECT ... FROM <collection>.<dataset> WHERE ..."`. The connector enforces a strict subset of SQL, `SELECT` only, no `JOIN`, no CTEs, and the query must reference exactly one table (matching the `<collection>.<dataset>` shape).

## Sample Nilus configs

Each example below is self-contained and uses the current Nilus pipeline shape.

### Batch, Delta on S3 → Lakehouse

```yaml
name: nilus-delta-s3-batch
version: v1alpha
type: nilus
description: Delta Lake (S3) → analytics Lakehouse incremental snapshot
spec:
  type: batch
  compute: runnable-default
  source:
    address: dataos://my-delta-s3-depot?purpose=r
    options:
      source_table: sales.orders
      incremental_key: commit_ts
      aws_region: us-west-2
  sink:
    address: dataos://analytics-lakehouse
    options:
      dest_table: sales.orders_curated
      incremental_strategy: merge
      loader_file_format: parquet
```

### Batch, Delta on Azure → Lakehouse

```yaml
spec:
  type: batch
  source:
    address: dataos://my-delta-azure-depot?purpose=r
    options:
      source_table: sales.orders
      incremental_key: commit_ts
  sink:
    address: dataos://analytics-lakehouse
    options:
      dest_table: sales.orders_curated
      incremental_strategy: merge
      loader_file_format: parquet
```

## Behavior and capabilities

* **Compute model**: the runtime starts a local DuckDB instance, installs the `delta`, `parquet`, and storage extensions (`aws` for S3, `azure` for ABFSS/WASBS), registers a persistent secret from the depot credentials, and creates a view of the Delta table via `delta_scan(<path>)`. Downstream extraction reads from that view.
* **Object model**: `<collection>.<dataset>` corresponds to a single Delta-table directory layout `<bucket>/<prefix>/<collection>/<dataset>/_delta_log/...`. Multi-level namespaces are not supported.
* **Pipeline mode**: `batch` only.
* **Snapshot semantics**: each run reads the current head snapshot of the Delta table. Time-travel reads (`VERSION AS OF` / `TIMESTAMP AS OF`) are not exposed through this connector, use a hand-written query if you need a specific version.
* **File format**: Delta data files are Parquet; the connector reads through DuckDB's `delta_scan` which handles transaction-log replay automatically.
* **Schema evolution**: DuckDB's `delta_scan` applies the latest schema from the Delta log. Renamed / dropped columns surface as the current schema's view of the table; downstream pipelines see only the current shape.

## Troubleshooting

| Symptom                                                          | Likely cause                                                                                                                                           | Resolution                                                                                                                                 |
| ---------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------ |
| `Unsupported source lakehouse type: 'gcs'.`                      | Trying to read a GCS-backed Delta table.                                                                                                               | Not currently supported. Migrate the dataset to Iceberg + GCS, or route through a Databricks workspace acting as a SQL endpoint.           |
| `aws_region is required for S3 lakehouse.` (S3 backend)          | Depot's connection secret does not carry `region`, and `aws_region` is not set on the pipeline.                                                        | Set `aws_region` in `source.options`, or update the depot's connection block to include `region`.                                          |
| `Container is required for Azure deltalake.` (Azure backend)     | Depot's connection block does not specify a container name.                                                                                            | Update the depot to include `abfss.container: <container-name>` (or `wasbs.container`).                                                    |
| `source_table is required for lakehouse source.`                 | `source_table` omitted from `source.options`.                                                                                                          | Set `source_table: <collection>.<dataset>` with exactly two parts.                                                                         |
| `Failed to create view for table: ...`                           | The Delta table path is wrong (collection / dataset names don't match the layout), or the principal lacks read permissions on the object-store prefix. | Verify the layout (`<storage-root>/<relativePath>/<collection>/<dataset>/_delta_log/...`). Re-check IAM / RBAC on the bucket or container. |
| `query should have two part namespace for table`                 | `source_table: "query:..."` references a table whose name is not `<collection>.<dataset>`.                                                             | Rewrite the SQL so the referenced table has exactly two name parts.                                                                        |
| Run extracts the full table every time despite `incremental_key` | The column is not monotonically increasing in the source, or the pipeline state was reset between runs.                                                | Pick a column that is genuinely monotonic.                                                                                                 |

## Related docs

* **DataOS Lakehouse destinations**: see the [AWS-backed DataOS Lakehouse](/concepts/resources/nilus/destinations/dataos-lakehouse/aws-backed.md), [Azure-backed DataOS Lakehouse](/concepts/resources/nilus/destinations/dataos-lakehouse/azure-backed.md), or [GCP-backed DataOS Lakehouse](/concepts/resources/nilus/destinations/dataos-lakehouse/gcp-backed.md) variants for writing into a Lakehouse.
* [Optimize Sink Datasets](/concepts/resources/nilus/pipeline-optimization/optimize-sink-datasets.md): guidance on `incremental_strategy`, `partition_by`, `cluster_by`, and other sink-side shape settings.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://v2.dataos.info/concepts/resources/nilus/batch/batch-sources/delta-lake.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
