> For the complete documentation index, see [llms.txt](https://v2.dataos.info/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://v2.dataos.info/concepts/resources/nilus/metadata-pipelines/metadata-sources/databricks-metadata.md).

# Databricks (Metadata)

[Databricks](https://docs.databricks.com/) is supported as a metadata source. A `spec.type: metadata` pipeline introspects a Databricks workspace through its SQL warehouse and publishes source context into the DataOS metadata catalog **without copying any table rows**. The same `service_type: databricks` works for both classic and Unity Catalog deployments, Nilus reads the catalog topology from the workspace.

For batch row movement out of Databricks, see the [Databricks](/concepts/resources/nilus/batch/batch-sources/databricks.md). For the field-by-field authoring contract, see [Understanding Metadata Pipeline Config](/concepts/resources/nilus/metadata-pipelines/pipeline-config.md).

## Metadata stages

`service_type: databricks` supports the full DAG. The required `mode` field decides how much of it runs: `shallow` runs `metadata` + `lineage`; `deep` adds `profiler`, `classification`, and `usage`. Source inventory (`metadata`) runs first; once it succeeds, the remaining stages run in parallel.

| Stage            | Runs in           | What it lands in the catalog                                                                       |
| ---------------- | ----------------- | -------------------------------------------------------------------------------------------------- |
| `metadata`       | `shallow`, `deep` | Catalogs, schemas, tables, views, and columns across the Unity Catalog or Hive Metastore topology. |
| `lineage`        | `shallow`, `deep` | Asset and column lineage parsed from view definitions and query history.                           |
| `profiler`       | `deep`            | Per-column statistics (row counts, null counts, distinct counts, min/max, basic distributions).    |
| `classification` | `deep`            | Auto-classification tags applied to columns (PII heuristics).                                      |
| `usage`          | `deep`            | Query history and popularity, which datasets are queried, when, and by what kinds of queries.      |

The first successful run establishes the source inventory and lineage; in `deep`, profiles, classification, and usage then deepen it. A frequent `shallow` pipeline keeps inventory and lineage fresh; pair it with a less frequent `deep` pipeline for the full profile.

## Source options

Metadata pipelines accept only the customer-facing `source.options` keys below. Do **not** set `source_table`, Nilus assigns a stage-specific value to each DAG node internally.

| Option               | Required | Used by stages     | Description                                                                                                                              |
| -------------------- | -------- | ------------------ | ---------------------------------------------------------------------------------------------------------------------------------------- |
| `service_type`       | Yes      | all                | Must be `databricks`.                                                                                                                    |
| `database_filter`    | No       | all                | Restrict by catalog name. Object with `includes` / `excludes` arrays of regex patterns.                                                  |
| `schema_filter`      | No       | all                | Restrict by schema name. Same shape as `database_filter`.                                                                                |
| `table_filter`       | No       | all                | Restrict by table / view name. Same shape as `database_filter`.                                                                          |
| `query_log_duration` | No       | `lineage`, `usage` | Days of query history to ingest per run. Defaults to `1`. Used by `lineage` (both modes) and `usage` (`deep` only).                      |
| `result_limit`       | No       | `lineage`, `usage` | Maximum number of query-history rows to fetch per run. Defaults to `10000000`. Used by `lineage` (both modes) and `usage` (`deep` only). |

`mode` (`shallow` or `deep`) is a required `spec` field, not a `source.options` key. See [Understanding Metadata Pipeline Config → Choosing a mode](/concepts/resources/nilus/metadata-pipelines/pipeline-config.md#choosing-a-mode).

## Required permissions

A metadata pipeline connects through a Databricks SQL warehouse and needs the token principal to be able to read the catalog/schema/table topology in scope:

* the SQL warehouse must be running, or autostart must be enabled;
* for Unity Catalog, grant `USE CATALOG`, `USE SCHEMA`, and table-level read on the objects in scope;
* for lineage and usage, the principal needs access to the workspace query history surfaces.

```sql
GRANT USE CATALOG ON CATALOG <catalog_name> TO `<principal>`;
GRANT USE SCHEMA ON SCHEMA <catalog_name>.<schema_name> TO `<principal>`;
GRANT SELECT ON SCHEMA <catalog_name>.<schema_name> TO `<principal>`;
```

For Hive Metastore (non-Unity-Catalog) deployments, drop the catalog-level grants and grant read on the database directly.

## Sample Nilus config

Databricks metadata connects through a direct `metadata+databricks://` URI with a projected access token; there is no DataOS depot variant for Databricks:

```yaml
name: databricks-metadata
version: v1alpha
type: nilus
tags: [nilus, metadata]
description: Catalog Databricks metadata, schema, lineage, and query usage
spec:
  type: metadata
  mode: deep
  compute: comet-compute
  schedule:
    crons:
      - "0 */6 * * *"
    concurrencyPolicy: Forbid
  use:
    projection:
      secrets:
        - id: engineering:databricks-secret
          contextAlias: dbxsecret
      projections:
        envVars:
          - key: DBX_TOKEN
            template: "{{ secrets['dbxsecret'].token | base64_decode }}"
  source:
    address: metadata+databricks://token:{DBX_TOKEN}@adb-12345.6.azuredatabricks.net?http_path=/sql/1.0/warehouses/abc123def456&catalog=main&schema=gold
    options:
      service_type: databricks
      database_filter:
        includes: ["main"]
      schema_filter:
        includes: ["^gold_", "^silver_"]
        excludes: ["^bronze_tmp_"]
      query_log_duration: 3
      result_limit: 10000
```

With `mode: deep`, this resource produces a five-node DAG (`metadata` root → `lineage`, `profiler`, `classification`, `usage`). Switch to `mode: shallow` for a 2-node `metadata` + `lineage` DAG. For more ready-to-edit examples, see [Metadata Sample Configs](/concepts/resources/nilus/metadata-pipelines/sample-configs.md).

## Behavior and capabilities

* **Connection**: Databricks metadata connects only through a direct `metadata+databricks://...` URI with a projected access token; there is no DataOS depot variant for Databricks. The URI carries the workspace host, HTTP path, catalog, and schema. Authentication uses Databricks personal access tokens (PAT) or OAuth M2M.
* **Compute target**: Nilus connects to a SQL warehouse, not a notebook or all-purpose cluster. Use a small dedicated warehouse for metadata runs.
* **Scope discipline**: set `database_filter` / `schema_filter` / `table_filter` in production; an unfiltered sweep over a large workspace can take hours per stage.

## Troubleshooting

| Symptom                                               | Likely cause                                                       | Resolution                                                                                   |
| ----------------------------------------------------- | ------------------------------------------------------------------ | -------------------------------------------------------------------------------------------- |
| `400 Invalid token` / `401 Unauthorized`              | Access token expired, revoked, or scoped to a different workspace. | Rotate the PAT or refresh the OAuth M2M token; confirm the workspace host matches the token. |
| `Cluster is not running and autostart is not enabled` | The SQL warehouse is stopped.                                      | Turn on auto-start for the SQL warehouse, or start it before the scheduled run.              |
| Inventory lands but lineage/usage are empty           | The principal can read tables but not query history.               | Grant the principal access to workspace query history.                                       |
| A stage runs for hours                                | The extraction scope is unbounded.                                 | Tighten `database_filter` / `schema_filter` / `table_filter`.                                |

## Related docs

* [Databricks](/concepts/resources/nilus/batch/batch-sources/databricks.md): batch row movement out of Databricks.
* [Metadata Sources](/concepts/resources/nilus/metadata-pipelines/metadata-sources.md): all metadata-capable sources and how to scope extraction.
* [Understanding Metadata Pipelines](/concepts/resources/nilus/metadata-pipelines.md): the conceptual model.
* [Understanding Metadata Pipeline Config](/concepts/resources/nilus/metadata-pipelines/pipeline-config.md): the `spec.type: metadata` contract and DAG anatomy.
* [Metadata Sample Configs](/concepts/resources/nilus/metadata-pipelines/sample-configs.md): ready-to-edit YAML.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://v2.dataos.info/concepts/resources/nilus/metadata-pipelines/metadata-sources/databricks-metadata.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
