> For the complete documentation index, see [llms.txt](https://v2.dataos.info/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://v2.dataos.info/build/stage-1-discover/inspect-metadata/ingest-metadata.md).

# Scan metadata

Use the Nilus **metadata scan workflow** to collect technical and business context from connected systems and register it in DataOS for discovery. A Nilus metadata workflow supports the following **metadata scan types.**

| Metadata scan type   | What gets extracted                                                |
| -------------------- | ------------------------------------------------------------------ |
| **Catalog metadata** | Databases, schemas, tables, and columns                            |
| **Column profiles**  | Null counts, distinct counts, min/max values, basic distributions  |
| **Classification**   | Auto-generated classification tags (e.g., possible PII indicators) |
| **Lineage**          | Column-level lineage from query history                            |
| **Usage**            | Query history and usage frequency from query log                   |

> Metadata scan capabilities depend on the connected source system. See the Concepts section for more details.

After metadata is extracted, users can discover and inspect datasets in DataOS through metadata-aware experiences such as Datasets, Products, lineage, and profiling views.

***

## When to use this workflow

Use this workflow when:

* Metadata from a source system is not yet available for discovery in DataOS.

***

## Before you start

Make sure you have:

* Access to the DataOS tenant where metadata should be registered.
* A source connection available through a DataOS Depot or an approved connection method.
* Permission to read metadata from the source system.
* Permission to read query history if lineage and usage are required.
* A compute profile available for the workflow.
* A clear scope for extraction, such as database, schema, or table filters.
* A schedule decision, such as a manual run or a scheduled refresh.

***

## Recommended user flow

Use this flow when setting up metadata extraction:

```
Choose source -> Define scope -> Configure workflow -> Run workflow -> Monitor run -> Verify metadata in DataOS
```

{% stepper %}
{% step %}

### Choose the source

Start by choosing the source system from which metadata should be extracted.

The source you choose determines what connection details and metadata capabilities are available.
{% endstep %}

{% step %}

### Confirm the connection

Metadata extraction needs a connection to the source system.

Use a DataOS Depot when available. A Depot keeps connection details and credentials outside the workflow definition, so users reference the source cleanly.

Example source address pattern:

```yaml
source:
  address: dataos://<depot-name>?purpose=rw
```

Use a direct connector URI only when the service is not backed by a Depot and your team has approved that pattern.
{% endstep %}

{% step %}

### Define the extraction scope

Do not scan everything unless you intentionally need to.

Define the scope using filters so the workflow extracts metadata only from relevant databases, schemas, and tables.

| Filter            | Use it to                                                      |
| ----------------- | -------------------------------------------------------------- |
| `database_filter` | Limit extraction to selected databases, projects, or catalogs. |
| `schema_filter`   | Limit extraction to selected schemas or namespaces.            |
| `table_filter`    | Include or exclude specific tables or views.                   |

Example:

```yaml
source:
  options:
    service_type: snowflake
    database_filter:
      includes:
        - "SP_TEST_DB"
        - "ANALYTICS_DB"
    schema_filter:
      includes:
        - "^MODEL"
      excludes:
        - "^TMP_"
    table_filter:
      excludes:
        - "^_audit"
```

Use filters, especially in production. Large unfiltered warehouses can take a long time to scan.
{% endstep %}

{% step %}

### Decide query history coverage

If you want lineage and usage, decide how much query history to collect.

| Option               | Purpose                                                |
| -------------------- | ------------------------------------------------------ |
| `query_log_duration` | Number of days of query history to ingest per run.     |
| `result_limit`       | Maximum number of query-history rows to fetch per run. |

Example:

```yaml
query_log_duration: 3
result_limit: 10000
```

Use a small window first, then expand it only if you need deeper lineage or usage history.
{% endstep %}

{% step %}

### Choose the refresh schedule

Decide whether to refresh metadata manually or on a schedule.

Use a schedule when metadata must stay current for discovery, profiling, lineage, or usage analysis.

Example scheduled refresh:

```yaml
schedule:
  crons:
    - "0 */6 * * *"
  timezone: UTC
  concurrencyPolicy: Forbid
```

If no schedule is defined, run the workflow manually when needed.
{% endstep %}

{% step %}

### Create the workflow definition

Create one metadata workflow definition (.yaml file) for each source service you want to catalog.
{% endstep %}

{% step %}

### Run the workflow

Apply the pipeline through CLI.

```bash
dataos-ctl resource apply -f ./snowflake-metadata.yaml -w <your-workspace>
```

{% endstep %}

{% step %}

### Monitor the run

After triggering the workflow, monitor the run status.

```shellscript
dataos-ctl resource get -t nilus -n snowflake-metadata -w <your-workspace>
```

{% endstep %}

{% step %}

### Verify metadata in DataOS

After the workflow completes, verify the extracted metadata in DataOS.

Go to:

```
DataOS Home -> Datasets
```

Then check:

| Area         | What to verify                                                               |
| ------------ | ---------------------------------------------------------------------------- |
| Source list  | The connected source appears.                                                |
| Dataset list | Tables and views from the scoped source are visible.                         |
| Overview     | Columns, tags, descriptions, and update information appear.                  |
| Lineage      | Table-level or column-level lineage appears where query history supports it. |

\| Queries | Query activity appears when usage extraction is enabled and supported. |

\| Productized status | Datasets show whether they are already used in a Data Product. |

If metadata is missing, review the configured scope, connection, service type, and permissions.

<figure><img src="/files/AntvYmvsBFh1y2UrW7hD" alt=""><figcaption></figcaption></figure>
{% endstep %}
{% endstepper %}

***

## Examples using Snowflake as a source

#### Using depot

```yaml
name: snowflake-metadata
version: v1alpha
type: nilus
tags: [nilus, metadata]
spec:
  type: metadata
  compute: comet-compute
  schedule:
    crons:
      - "0 */6 * * *"
    concurrencyPolicy: Forbid
  source:
    address: dataos://snowflake-metadata-depot?purpose=rw
    options:
      service_type: snowflake
      database_filter:
        includes: ["SP_TEST_DB", "ANALYTICS_DB"]
      schema_filter:
        includes: ["^MODEL", "^GOLD_"]
        excludes: ["^TMP_"]
      table_filter:
        excludes: ["^_audit"]
      query_log_duration: 3
      result_limit: 10000
```

#### Using direct URI

```yaml
name: snowflake-metadata-direct
version: v1alpha
type: nilus
tags: [nilus, metadata]
spec:
  type: metadata
  compute: comet-compute
  schedule:
    crons:
      - "0 */6 * * *"
    concurrencyPolicy: Forbid
  use:
    projection:
      secrets:
        - id: engineering:snowflake-secret
          contextAlias: snowsecret
      projections:
        envVars:
          - key: SF_USER
            template: "{{ secrets['snowsecret'].user | base64_decode }}"
          - key: SF_PASSWORD
            template: "{{ secrets['snowsecret'].password | base64_decode }}"
  source:
    address: metadata+snowflake://{SF_USER}:{SF_PASSWORD}@xy12345.snowflakecomputing.com/PROD_DB?warehouse=METADATA_WH&role=METADATA_RO
    options:
      service_type: snowflake
      database_filter:
        includes: ["SP_TEST_DB"]
      schema_filter:
        includes: ["^MODEL"]
      query_log_duration: 3
      result_limit: 10000
```

{% hint style="info" %}
For direct connector URIs, project credentials under spec.use.projection and reference them as {ENV\_VAR} placeholders in source.address.
{% endhint %}

For more information about configuration elements, see the [Nilus Metadata Pipelines](/concepts/resources/nilus/metadata-pipelines.md) section.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://v2.dataos.info/build/stage-1-discover/inspect-metadata/ingest-metadata.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
