> For the complete documentation index, see [llms.txt](https://v2.dataos.info/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://v2.dataos.info/concepts/resources/nilus/metadata-pipelines/pipeline-config.md).

# Pipeline Config

> Nilus metadata pipelines are authored as `type: nilus` resources with `spec.type: metadata`.

This page explains the Nilus **metadata** pipeline config shape, a single workflow that extracts catalog metadata, column profiles, classification tags, query lineage, and query usage from a connected data system and lands them in the DataOS metadata catalog. You author **one** Nilus resource per source system; Nilus expands it into a multi-stage DAG behind the scenes. A required `mode` field (`shallow` or `deep`) controls how many of those stages run.

For the conceptual model, see [Understanding Metadata Pipelines](/concepts/resources/nilus/metadata-pipelines.md). For ready-to-edit YAML by source, see [Metadata Sample Configs](/concepts/resources/nilus/metadata-pipelines/sample-configs.md).

{% hint style="info" %}
**Supported sources.** This guide documents metadata pipelines for **Snowflake**, **Databricks**, and **DataOS Lakehouse** (Iceberg). Snowflake and Databricks run the multi-stage DAG; DataOS Lakehouse publishes catalog inventory only. Nilus's metadata framework also recognizes additional relational and warehouse source types at the connection layer, but those are not yet covered here, treat them as roadmap until this guide documents them.
{% endhint %}

<details>

<summary>Example Config YAML</summary>

```yaml
name: snowflake-metadata
version: v1alpha
type: nilus
tags:
  - nilus
  - metadata
description: Catalog Snowflake metadata, schema, lineage, and query usage
spec:
  type: metadata
  mode: deep
  compute: comet-compute
  logLevel: INFO
  resources:
    requests:
      cpu: "200m"
      memory: "512Mi"
  schedule:
    crons:
      - "0 */6 * * *"
    concurrencyPolicy: Forbid
  source:
    address: dataos://snowflake-metadata-depot?purpose=rw
    options:
      service_type: snowflake
      database_filter:
        includes:
          - "PROD_DB"
          - "ANALYTICS_DB"
      schema_filter:
        includes:
          - "^MODEL"
        excludes:
          - "^TMP_"
      table_filter:
        excludes:
          - "^_audit"
      query_log_duration: 3
      result_limit: 10000
```

Because this example sets `mode: deep`, that single resource produces all five workflow stages (`metadata`, `lineage`, `profiler`, `classification`, `usage`). With `mode: shallow` it would produce only `metadata` and `lineage`. Either way you do **not** author a separate resource per stage.

</details>

## Choosing a mode

`mode` is **required** on every metadata pipeline and decides how much of the DAG runs. It lets you group lighter and heavier extraction work and trigger them on different cadences.

| Mode      | Stages it runs (warehouse sources)                               | Use it when                                                                                                                                                                                   |
| --------- | ---------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `shallow` | `metadata` + `lineage`                                           | You want the source inventory and column-level lineage kept current, without the heavier profiling, classification, and usage scans. Lighter and faster, so it can run on a tighter schedule. |
| `deep`    | `metadata` + `lineage` + `profiler` + `classification` + `usage` | You want the full enrichment, column statistics, sensitive-data tags, and query-usage analytics in addition to inventory and lineage. Heavier, so it is usually scheduled less frequently.    |

`deep` is a strict superset of `shallow`: it runs the same `metadata` and `lineage` stages and adds `profiler`, `classification`, and `usage`. A common pattern is a frequent `shallow` pipeline for fresh structure and lineage plus a less frequent `deep` pipeline for the full profile.

## How a metadata pipeline expands

When you submit a `spec.type: metadata` resource, Nilus's domain template renders it into a **scheduled workflow** whose DAG shape depends on `mode`. The `metadata` stage always runs first; once it succeeds, the remaining stages run in parallel.

`mode: shallow` (Snowflake / Databricks):

```
{name}-metadata ─── {name}-lineage
```

`mode: deep` (Snowflake / Databricks):

```
                ┌─ {name}-lineage
                │
{name}-metadata ─┼─ {name}-profiler
                │
                ├─ {name}-classification
                │
                └─ {name}-usage
```

Each DAG node sets its own `source_table` internally (`"metadata"`, `"lineage"`, `"profiler"`, `"classification"`, `"usage"`) and receives the relevant slice of your `source.options`. You never set `source_table` yourself on a metadata resource.

{% hint style="info" %}
**Lakehouse exception.** When `service_type: lakehouse`, the template renders **only the `metadata` stage** in both modes, lineage, profiler, classification, and usage are skipped. `mode` is still required by the schema, but it does not change the rendered DAG for Lakehouse. Iceberg metadata pipelines produce the catalog backbone only.
{% endhint %}

### What each stage produces

| Stage            | Runs in           | What it lands in the catalog                                                        |
| ---------------- | ----------------- | ----------------------------------------------------------------------------------- |
| `metadata`       | `shallow`, `deep` | Database, schema, table, and column entities. Foundation for everything else.       |
| `lineage`        | `shallow`, `deep` | Column-level data lineage extracted by parsing query history.                       |
| `profiler`       | `deep`            | Per-column statistics (null counts, distinct counts, min/max, basic distributions). |
| `classification` | `deep`            | Auto-classification tags applied to columns (e.g. PII heuristics).                  |
| `usage`          | `deep`            | Query history and usage frequency from the service's query log.                     |

The customer-facing schedule applies to the whole DAG, one cron, one workflow run, every stage the mode selects.

## Configuration elements

Fields are grouped below by function.

### 1. Pipeline metadata

| Field         | Description                                                                                                         |
| ------------- | ------------------------------------------------------------------------------------------------------------------- |
| `name`        | Unique pipeline identifier. Used as the prefix for every DAG node name (`{name}-metadata`, `{name}-profiler`, ...). |
| `version`     | Use `v1alpha` for the current Nilus resource shape.                                                                 |
| `type`        | Must be `nilus` for Nilus-managed pipelines.                                                                        |
| `tags`        | Optional labels for search, grouping, and operations.                                                               |
| `description` | Optional human-readable summary.                                                                                    |

### 2. Nilus spec

| Field       | Required | Description                                                                                                                                                                                                                             |
| ----------- | -------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `spec.type` | Yes      | Must be `metadata` for catalog / lineage / profiler / classification / usage extraction.                                                                                                                                                |
| `mode`      | Yes      | `shallow` or `deep`. Selects how much of the DAG runs, `shallow` is `metadata` + `lineage`; `deep` adds `profiler`, `classification`, and `usage`. Required on every metadata pipeline.                                                 |
| `compute`   | Yes      | Compute profile used to run each DAG node. The reference template uses `comet-compute`.                                                                                                                                                 |
| `logLevel`  | No       | `DEBUG`, `INFO`, `WARNING`, or `ERROR`. Applies to every stage in the DAG.                                                                                                                                                              |
| `runAsUser` | No       | Optional runtime identity. When omitted, Nilus uses the resource owner.                                                                                                                                                                 |
| `resources` | No       | Optional CPU and memory requests / limits applied to **every** DAG node. Metadata workflows are typically lightweight, `200m` / `512Mi` is a sensible starting point. `deep` runs more stages, so give it more headroom than `shallow`. |
| `schedule`  | No       | Optional cron schedule for the whole workflow. One cron drives every stage the mode selects.                                                                                                                                            |
| `use`       | No       | Optional secret projection rules. Same projections are applied to every DAG node.                                                                                                                                                       |
| `source`    | Yes      | The source system Nilus introspects and the per-stage options.                                                                                                                                                                          |

{% hint style="info" %}
Metadata pipelines do **not** declare a `sink` block. Nilus auto-attaches the catalog sink for every DAG node, the catalog target is configured at the DataOS metadata-service level, not per pipeline.
{% endhint %}

### 3. Schedule

`schedule` controls the entire DAG. Each scheduled run executes every stage the mode selects (or just `metadata` for Lakehouse) end-to-end. Start with a 6-hour cadence; tighten only if downstream consumers need fresher catalog data. A `shallow` pipeline is lighter, so it can run more often than a `deep` one over the same source.

```yaml
spec:
  schedule:
    crons:
      - "0 */6 * * *"
    timezone: UTC
    concurrencyPolicy: Forbid
```

* `crons` is an array, not a single `cron` string.
* If `schedule` is omitted, the workflow behaves like an instance that must be triggered manually.

### 4. Source

The `source` block identifies the source system Nilus introspects.

```yaml
spec:
  source:
    address: dataos://snowflake-metadata-depot?purpose=rw
    options:
      service_type: snowflake
      database_filter:
        includes: ["PROD_DB"]
      schema_filter:
        includes: ["^MODEL"]
      table_filter:
        excludes: ["^_audit"]
      query_log_duration: 3
      result_limit: 10000
```

#### `source.options` reference

These are the customer-facing `source.options` keys that the current metadata domain template maps into the rendered workflow. Keep the authored resource to this contract even though lower-level extraction code has additional internal knobs.

| Option               | Required | Used by stages                 | Description                                                                                                                                                                                                                                                                                                                         |
| -------------------- | -------- | ------------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `service_type`       | Yes      | all                            | Identifies the source-system kind Nilus is introspecting. Documented today: `snowflake`, `databricks`, `lakehouse` (alias `iceberg`). Drives both the extractor and the catalog entity type that gets registered.                                                                                                                   |
| `database_filter`    | No       | all                            | Restrict by database / project / catalog name. Object with `includes` / `excludes` arrays of regex patterns.                                                                                                                                                                                                                        |
| `schema_filter`      | No       | all                            | Restrict by schema / namespace name. Same shape as `database_filter`.                                                                                                                                                                                                                                                               |
| `table_filter`       | No       | all                            | Restrict by table / view name. Same shape as `database_filter`.                                                                                                                                                                                                                                                                     |
| `query_log_duration` | No       | `lineage`, `usage`             | Days of query history to ingest per run. Defaults to `1`. Used by `lineage` (both modes) and `usage` (`deep` only); ignored by `metadata`, `profiler`, and `classification`.                                                                                                                                                        |
| `result_limit`       | No       | `lineage`, `usage`             | Maximum number of query-history rows to fetch per run. Defaults to `10000000`. Used by `lineage` (both modes) and `usage` (`deep` only); ignored by `metadata`, `profiler`, and `classification`.                                                                                                                                   |
| `threads`            | No       | `profiler`, `usage`, `lineage` | Number of parallel workers Nilus uses to parse query history and publish profile/usage/lineage results. Integer `≥ 1`; omit for serial processing. Raising it cuts wall-clock runtime on large `profiler` and `usage` workflows; scale it with the warehouse's available compute and the volume of objects/queries being processed. |

{% hint style="info" %}
Nilus's internal metadata-extraction engine accepts additional lower-level knobs (`stored_procedure_filter`, `classification_filter`, `parsing_timeout_limit`, `query_parsing_timeout_limit`, `json_schema_sample_size`). These are **not** part of the customer-facing v1alpha Nilus contract today. Do not add them to published examples until the domain template and docs intentionally expose them.
{% endhint %}

#### Filter pattern examples

All three filter options share the same shape: a nested `includes` / `excludes` block whose values are arrays of regex patterns evaluated against the corresponding catalog object name. They are optional but recommended, running an unfiltered metadata workflow against a large warehouse can take hours.

```yaml
schema_filter:
  includes:
    - "^MODEL"
    - "^GOLD_"
  excludes:
    - "^TMP_"
```

The same filter values flow into the catalog and enrichment stages that operate over source objects (`metadata`, `profiler`, `classification`, and `lineage`). The `usage` stage is query-log driven and primarily uses `query_log_duration` / `result_limit`.

#### Direct URI vs depot address

* Use `dataos://<depot>?purpose=rw` when the credentials and host details should come from a DataOS depot. This is the canonical pattern for Snowflake and Lakehouse metadata pipelines. The depot spec handles the connection target (account/host, database, warehouse); the depot's connection secret handles the username, optional role, and credentials. Databricks has no DataOS depot variant, so it always uses the direct URI form below.
* Use a direct connector URI with the `metadata+` prefix (`metadata+snowflake://...`, `metadata+databricks://...`, `metadata+lakehouse://...`) when the service is not depot-backed (and always for Databricks). Project secrets through `spec.use.projection` and reference them as `{ENV_VAR}` placeholders in `source.address`.

#### Source-specific requirements

Metadata pipelines need broader read permissions than row-copy pipelines because they inspect source metadata and, for some sources, query history:

| Source           | Requirements to plan for                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
| ---------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Snowflake        | Grant access to the warehouse, database, schemas, tables/views in scope, and account-usage surfaces such as query history, tag references, procedures, and functions. If lineage for policy-tagged views is incomplete, the source role may also need the relevant policy/tag privileges. See [Snowflake (Metadata) → Required permissions](/concepts/resources/nilus/metadata-pipelines/metadata-sources/snowflake-metadata.md#required-permissions) for the detailed grant pattern. |
| Databricks       | Grant the token principal access to the workspace SQL warehouse plus the catalog/schema/table metadata in scope. For Unity Catalog, plan for `USE CATALOG`, `USE SCHEMA`, and table-level read privileges.                                                                                                                                                                                                                                                                            |
| DataOS Lakehouse | Use a DataOS Lakehouse depot. The current metadata path publishes catalog inventory only; profiler, classification, lineage, and usage stages are skipped.                                                                                                                                                                                                                                                                                                                            |

### 5. Secrets and projections

For direct connector URIs, project credentials under `spec.use.projection` and reference them as `{ENV_VAR}` placeholders in `source.address`. The same projection wiring is applied to every DAG node:

```yaml
spec:
  use:
    projection:
      secrets:
        - id: engineering:snowflake-secret
          contextAlias: snowsecret
      projections:
        envVars:
          - key: SNOWFLAKE_USER
            template: "{{ secrets['snowsecret'].user | base64_decode }}"
          - key: SNOWFLAKE_PASSWORD
            template: "{{ secrets['snowsecret'].password | base64_decode }}"
  source:
    address: metadata+snowflake://{SNOWFLAKE_USER}:{SNOWFLAKE_PASSWORD}@account/db?warehouse=metadata_wh&role=metadata_role
    options:
      service_type: snowflake
```

When using `dataos://<depot>?purpose=rw`, the depot supplies the credentials automatically and `spec.use.projection` is usually unnecessary.

### 6. Execution rendering

When you submit the resource, the Nilus domain template renders it as follows:

* For `service_type: snowflake | databricks` with `mode: shallow` → a `workflow` resource with a 2-node DAG. `metadata` is the root; `lineage` declares `dependencies: [{name}-metadata]` and runs after it.
* For `service_type: snowflake | databricks` with `mode: deep` → a `workflow` resource with a 5-node DAG. `metadata` is the root; `lineage`, `profiler`, `classification`, and `usage` declare `dependencies: [{name}-metadata]` and run in parallel after the root completes.
* For `service_type: lakehouse` → a `workflow` resource with a 1-node DAG containing just the `metadata` stage, regardless of `mode`.
* Every node uses the same `compute`, `resources`, `runAsUser`, `logLevel`, and `use.projection` from your spec. Every node receives a stack-level catalog sink injected by Nilus, you do not author, see, or override it on the customer resource.
* Each node's internal `stackSpec.source.options.source_table` is hardcoded by the template to that node's stage name.

You don't see the expanded workflow directly, `kubectl get workflow {name}` (or the DataOS resource browser) will show you the materialized resource with all DAG nodes. The Nilus resource you authored remains your single source of truth.

## Validation notes

* Use `type: nilus` with `spec.type: metadata` for catalog / lineage / profiler / classification / usage extraction.
* `mode` is **required** and must be `shallow` or `deep`. `shallow` runs `metadata` + `lineage`; `deep` adds `profiler`, `classification`, and `usage`. A metadata resource without `mode` is rejected at schema validation.
* `service_type` is required under `source.options`. This guide documents `snowflake`, `databricks`, and `lakehouse` (alias `iceberg`).
* You do **not** set `source_table` on a metadata resource, Nilus hardcodes one value per DAG node.
* Keep the authored `source.options` block to the seven customer-facing options in the table above (`service_type`, three filters, `threads`, `query_log_duration`, `result_limit`). Lower-level extraction knobs are internal until the Nilus domain template exposes them.
* For `service_type: lakehouse`, only the `metadata` stage runs in both modes. `mode` is still required, but lineage / profiler / classification / usage are skipped at template-render time.
* Always set `database_filter` / `schema_filter` / `table_filter` on production deployments. Unbounded scopes against a large warehouse can take hours per stage, and you pay that cost on every scheduled run.
* Use `spec.use.projection` only when the source URI is direct. Use `dataos://...purpose=rw` when the connection should come from a depot.

## Related docs

* [Metadata Sample Configs](/concepts/resources/nilus/metadata-pipelines/sample-configs.md)
* [Secrets and Projections](/concepts/resources/nilus/concepts/secrets-and-projections.md)
* [Understanding Batch Pipeline Config](/concepts/resources/nilus/batch/pipeline-config.md)
* [Understanding CDC Pipeline Config](/concepts/resources/nilus/cdc/service-config.md)
* [Understanding Stream Pipeline Config](/concepts/resources/nilus/stream.md)


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://v2.dataos.info/concepts/resources/nilus/metadata-pipelines/pipeline-config.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.