> For the complete documentation index, see [llms.txt](https://v2.dataos.info/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://v2.dataos.info/concepts/resources/nilus/batch/batch-sources/databricks.md).

# Databricks

[Databricks](https://docs.databricks.com/) is a unified analytics platform built on Apache Spark. Nilus reads from Databricks as a **batch source** through the standard Databricks SQL endpoint. Batch extraction uses a SQL warehouse (formerly "SQL endpoint") attached to a Unity Catalog or Hive Metastore, through Databricks's official SQL connector under the SQLAlchemy `databricks://` dialect.

For information on writing **into** Databricks, see the [Databricks](/concepts/resources/nilus/destinations/cloud-warehouses/databricks.md).

## Requirements

Connectivity and credentials must both be in place before the pipeline can run.

### Connectivity

* The Nilus runtime must reach the Databricks workspace's SQL warehouse endpoint, typically `<workspace>.cloud.databricks.com` (AWS), `adb-<id>.<n>.azuredatabricks.net` (Azure), or `<workspace>.gcp.databricks.com` (GCP) over HTTPS 443.
* The SQL warehouse must be running, or autostart must be enabled, Nilus does not start a stopped warehouse before reading. Warehouse start time directly affects pipeline run duration.
* The personal access token (or OAuth M2M token) used to connect must have `USE CATALOG`, `USE SCHEMA`, and `SELECT` on every table the pipeline reads.

### Required parameters

| Parameter      | Required    | Default   | Description                                                                                                                                    |
| -------------- | ----------- | --------- | ---------------------------------------------------------------------------------------------------------------------------------------------- |
| `host`         | Yes         | -         | Databricks workspace hostname (no `https://` prefix, no path).                                                                                 |
| `http_path`    | Yes         | -         | HTTP path of the SQL warehouse, of the form `/sql/1.0/warehouses/<warehouse-id>`. Available from the SQL warehouse's "Connection details" tab. |
| `access_token` | Yes         | -         | A Databricks personal access token (PAT) or OAuth M2M access token with the catalog / schema / table grants listed above.                      |
| `catalog`      | Yes         | -         | Unity Catalog catalog name (`main`, `hive_metastore`, or a custom catalog). For workspaces without Unity Catalog, use `hive_metastore`.        |
| `schema`       | Recommended | `default` | Default schema for unqualified table references. Set explicitly so `source_table` resolution is reproducible.                                  |

### Database-side permissions

```sql
GRANT USE CATALOG ON CATALOG <catalog_name> TO `<principal>`;
GRANT USE SCHEMA ON SCHEMA <catalog_name>.<schema_name> TO `<principal>`;
GRANT SELECT ON SCHEMA <catalog_name>.<schema_name> TO `<principal>`;
```

For non-Unity-Catalog (Hive Metastore) deployments, drop the catalog-level grants and grant `SELECT` on the database directly.

### URI format

```
databricks://token:<access_token>@<host>:443/<schema>?catalog=<catalog>&http_path=<http_path>
```

The username portion is the literal string `token`; the password portion is the access token. `http_path` must be URL-encoded if it contains slashes (most pipeline manifests pass it as a query parameter, where slashes are accepted as-is).

{% hint style="info" %}
Databricks connects only through the direct `databricks://` URI; there is no DataOS depot variant for Databricks in Nilus (for both batch and metadata pipelines). Project the access token through `spec.use.projection` and reference it as a `{ENV_VAR}` placeholder in the URI instead of inlining a literal token.
{% endhint %}

## Source options

| Option                 | Required | Description                                                                                                                                                                             |
| ---------------------- | -------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `source_table`         | Yes      | Fully qualified table name in `<schema>.<table>` form. Add an explicit catalog prefix (`<catalog>.<schema>.<table>`) only when reading across catalogs (Unity Catalog).                 |
| `incremental_key`      | No       | Timestamp or numeric column used to identify newly visible rows for each run. Delta tables typically expose `_commit_timestamp` (when enabled) or a user-managed `updated_at` column.   |
| `interval_start`       | No       | Optional ISO-8601 lower bound for the extraction window.                                                                                                                                |
| `interval_end`         | No       | Optional ISO-8601 upper bound for the extraction window.                                                                                                                                |
| `page_size`            | No       | Rows per extraction batch (default `50000`).                                                                                                                                            |
| `sql_reflection_level` | No       | `full` (default) or reduced, controls how thoroughly Nilus reflects the source schema before extraction. Useful when the warehouse has thousands of tables and full reflection is slow. |
| `sql_limit`            | No       | Caps total rows extracted per run. Use for sampling and validation.                                                                                                                     |
| `sql_exclude_columns`  | No       | Comma-separated column names to skip during extraction.                                                                                                                                 |
| `type_hints`           | No       | Object map of `column_name: <type>` to override inferred types. Supported types: `text`, `bigint`, `bool`, `timestamp`, `date`, `decimal`, `double`, `binary`, `json`, `time`.          |
| `max_table_nesting`    | No       | String. `"0"` for fully flattened analytics-friendly output (default for SQL sources).                                                                                                  |

## Sample Nilus configs

Each example below is self-contained and uses the current Nilus pipeline shape.

### Batch, Unity Catalog table to Lakehouse (direct URI with projected access token)

```yaml
name: nilus-databricks-batch
version: v1alpha
type: nilus
description: Databricks → DataOS Lakehouse incremental snapshot
spec:
  type: batch
  compute: runnable-default
  use:
    projection:
      secrets:
        - id: engineering:databricks-secret
          contextAlias: dbxsecret
      projections:
        envVars:
          - key: DBX_TOKEN
            template: "{{ secrets['dbxsecret'].token | base64_decode }}"
  source:
    address: databricks://token:{DBX_TOKEN}@adb-12345.6.azuredatabricks.net:443/analytics?catalog=main&http_path=/sql/1.0/warehouses/abc123def456
    options:
      source_table: analytics.orders
      incremental_key: updated_at
  sink:
    address: dataos://analytics-lakehouse
    options:
      dest_table: sales.orders_raw
      incremental_strategy: merge
      loader_file_format: parquet
```

## Behavior and capabilities

* **Compute model**: Nilus connects through the Databricks SQLAlchemy dialect and submits parameterized SQL reads against the configured SQL warehouse. Notebook clusters / all-purpose clusters are not the connection target, only SQL warehouses are supported.
* **Object model**: Unity Catalog `<catalog>.<schema>.<table>` (three-level) or Hive Metastore `<database>.<table>` (two-level). `source_table` uses the two-part `<schema>.<table>` form by default; prefix with the catalog only when you cross catalogs.
* **Pipeline mode**: this page documents Databricks batch extraction. Databricks also supports a metadata pipeline that catalogs the source without copying rows; see [Databricks (Metadata)](/concepts/resources/nilus/metadata-pipelines/metadata-sources/databricks-metadata.md).
* **Authentication modes**: Databricks personal access tokens (PAT) and OAuth M2M (workspace-managed application principal) tokens. The connector treats both as opaque bearer tokens; rotate them on the standard Databricks cadence (PATs default to 90 days).
* **Custom queries**: supply `source_table: "query:SELECT ... FROM ..."` to extract from a hand-authored SQL query.
* **Photon / serverless**: both are fully supported. The connection model is identical; pipeline durations may improve substantially with Photon-enabled or serverless warehouses for large reads.

## Troubleshooting

| Symptom                                                           | Likely cause                                                                                                                 | Resolution                                                                                                                                                     |
| ----------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `400 Invalid token` / `401 Unauthorized`                          | Access token expired, revoked, or scoped to a different workspace.                                                           | Rotate the PAT or refresh the OAuth M2M token. Confirm the workspace host matches the token's workspace.                                                       |
| `Cluster is not running and autostart is not enabled`             | The SQL warehouse is stopped and the workspace does not have autostart configured.                                           | Enable "Auto-start on query" on the SQL warehouse (Settings → SQL warehouse → Auto stop / Auto start), or start the warehouse before scheduling the pipeline.  |
| `Table or view not found: <table>`                                | `source_table` references a table in a catalog/schema the access token can't see, or the schema is in a non-default catalog. | Use the three-part `<catalog>.<schema>.<table>` form, or set the `catalog` query parameter to the right catalog. Re-check `USE CATALOG` / `USE SCHEMA` grants. |
| `Insufficient privileges: SELECT denied on table`                 | Token's principal has catalog/schema access but not table-level `SELECT`.                                                    | Grant `SELECT` on the table (or schema-level `GRANT SELECT ON SCHEMA`).                                                                                        |
| Run extracts the full table every time despite `incremental_key`  | The column is not monotonically increasing in the source, or the pipeline state was reset between runs.                      | Pick a column that is genuinely monotonic. For Delta tables, prefer a user-managed `updated_at` column over `_commit_timestamp` unless time-travel is enabled. |
| Warehouse "starting" for several minutes before extraction begins | Cold SQL warehouse + autostart enabled but warehouse is large.                                                               | Use a dedicated small SQL warehouse (Pro or Serverless, 2X-Small) sized to the ingestion workload.                                                             |
| Reflection step takes minutes before first row is extracted       | The catalog/schema has thousands of tables; SQLAlchemy reflects metadata for the full namespace.                             | Lower `sql_reflection_level`, or move ingest to a dedicated schema with only the tables Nilus needs.                                                           |

## Related docs

* [Databricks](/concepts/resources/nilus/destinations/cloud-warehouses/databricks.md): companion destination connector.
* [Databricks (Metadata)](/concepts/resources/nilus/metadata-pipelines/metadata-sources/databricks-metadata.md): catalog Databricks without copying rows via a `spec.type: metadata` pipeline.
* [Optimize Sink Datasets](/concepts/resources/nilus/pipeline-optimization/optimize-sink-datasets.md): guidance on `incremental_strategy`, `partition_by`, `cluster_by`, and other shape settings.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://v2.dataos.info/concepts/resources/nilus/batch/batch-sources/databricks.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
