> For the complete documentation index, see [llms.txt](https://v2.dataos.info/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://v2.dataos.info/concepts/resources/nilus/metadata-pipelines.md).

# Metadata Pipelines

Moving data is only one part of making a platform usable. Teams also need to know what sources exist, which databases and schemas they contain, how columns are described, where lineage comes from, and whether query activity shows that an asset is actually being used.

Metadata pipelines answer that need for supported source systems. Instead of copying table rows into a destination, a metadata pipeline introspects a source and publishes catalog context into the DataOS metadata catalog.

For the field-by-field authoring contract, see [Understanding Metadata Pipeline Config](/concepts/resources/nilus/metadata-pipelines/pipeline-config.md). That page is the single reference for `source.options` and YAML shape; this page explains the conceptual model only.

## Overview

Nilus models catalog extraction as `type: nilus` with `spec.type: metadata`. You author one Nilus resource per source system, and Nilus expands that resource into a workflow that runs one or more metadata extraction stages behind the scenes.

A required `mode` field groups those stages so you can trigger lighter and heavier work on different cadences:

* **`shallow`** runs source inventory (`metadata`) and lineage. It keeps the catalog structure and lineage current with a small footprint, so it can run frequently.
* **`deep`** runs everything `shallow` does and adds column profiling, classification, and usage analytics. It is the full enrichment and is usually scheduled less often.

For Snowflake and Databricks, `shallow` renders `metadata` + `lineage`, and `deep` adds `profiler`, `classification`, and `usage`. For DataOS Lakehouse, the rendered workflow runs only the `metadata` stage in either mode, because Lakehouse metadata support is catalog-inventory only today.

A metadata pipeline turns source context into catalog entities: the source connection becomes a browsable source entry; databases, schemas, and namespaces become the hierarchy; tables and views become datasets; columns and descriptions become dataset detail; and owners, tags, profiles, lineage, usage, and classifications attach as enrichment where the source exposes them. The first successful run usually establishes the source inventory; later runs deepen it with the enrichment signals the source supports. A common pattern is a frequent `shallow` pipeline paired with a less frequent `deep` pipeline over the same source.

## Core Capabilities

Metadata extraction is organized into the following stages:

### Source Inventory

The `metadata` stage is the foundation. It discovers the source service, database/catalog hierarchy, schemas or namespaces, tables, views, columns, and source-provided descriptions where available.

Every other stage depends on this foundation. If the source inventory is wrong or too broad, profile, lineage, usage, and classification results become harder to validate.

### Lineage

Lineage explains which assets feed which, down to the column level, by parsing view definitions and query history. It runs in both `shallow` and `deep`, so even a lightweight pipeline keeps lineage current alongside the source inventory.

### Profiling And Classification

Profiling adds statistics such as row counts, null counts, distinct counts, min/max values, and basic distributions. Classification samples supported columns and applies generated sensitive-data tags when the classification path is available. Both run only in `deep`.

These stages are enrichment layers. They are useful after the source hierarchy is correct, but they should not be treated as a substitute for validating source scope and permissions.

### Usage

Usage is query-log driven for warehouse-style sources. It uses source query history to explain which assets are actively queried. Usage runs only in `deep`, and it can take longer than inventory because Nilus reads and processes query activity across a lookback window. The catalog can show source inventory and lineage before usage finishes.

### Lakehouse Metadata

DataOS Lakehouse metadata pipelines publish catalog inventory only. They can surface Lakehouse-backed datasets and column context, but they do not publish profiler, classification, lineage, or usage through this metadata path.

## Flow

1. Nilus resolves the source connection from a DataOS depot or a direct `metadata+...` URI, depending on the source. Databricks metadata uses the direct URI path only.
2. The metadata workflow runs the source inventory stage first.
3. For supported warehouse sources, the stages selected by `mode` run after inventory has succeeded, lineage in `shallow`, plus profiler, classification, and usage in `deep`.
4. Nilus publishes the resulting source, dataset, column, profile, lineage, usage, and classification context into the DataOS metadata catalog.

## Constraints

* `mode` (`shallow` or `deep`) is required on every metadata pipeline. It controls which stages run, not whether the pipeline is valid.
* Metadata pipelines need source credentials with metadata-read permissions, not just data-read permissions.
* Warehouse sources that publish lineage or usage also need access to query-history surfaces.
* Production deployments should use `database_filter`, `schema_filter`, and `table_filter` to keep the extraction scope bounded.
* Keep the source connection stable. Changing depot names or source identity can create a new source entry instead of updating the existing one.
* Use the supported `source.options` on the config page. Do not set `source_table` on the user-authored metadata resource; Nilus assigns stage-specific values internally.

## Related Docs

* [Metadata Sources](/concepts/resources/nilus/metadata-pipelines/metadata-sources.md)
* [Understanding Metadata Pipeline Config](/concepts/resources/nilus/metadata-pipelines/pipeline-config.md)
* [Metadata Sample Configs](/concepts/resources/nilus/metadata-pipelines/sample-configs.md)
* [Snowflake (Metadata)](/concepts/resources/nilus/metadata-pipelines/metadata-sources/snowflake-metadata.md)
* [Databricks (Metadata)](/concepts/resources/nilus/metadata-pipelines/metadata-sources/databricks-metadata.md)
* [DataOS Lakehouse (Metadata)](/concepts/resources/nilus/metadata-pipelines/metadata-sources/dataos-lakehouse-metadata.md)


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://v2.dataos.info/concepts/resources/nilus/metadata-pipelines.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
