> For the complete documentation index, see [llms.txt](https://v2.dataos.info/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://v2.dataos.info/build/stage-2-productize/define-the-contract/data-quality.md).

# Data quality

Data Quality rule packs are non-blocking validation rules that monitor your data over time. Unlike assertions (which stop execution when they fail), Data Quality checks let models run while surfacing warnings and building a historical picture of your data's health.

***

## When to use DQ checks vs assertions

| Use DQ checks for                        | Use Assertions instead for                   |
| ---------------------------------------- | -------------------------------------------- |
| Monitoring quality trends over time      | Critical rules that must block bad data      |
| Anomaly detection (sudden drops, spikes) | Simple NULL/uniqueness checks on key columns |
| Non-critical warnings                    | Inline business rule enforcement             |
| Cross-table consistency monitoring       | Any rule where failure should stop execution |

***

## The three-layer quality strategy

```
┌─────────────────────────────────────────┐
│  AUDITS (Critical: Blocks models)       │
│  • Primary keys must be unique          │
│  • Revenue must be non-negative         │
└─────────────────────────────────────────┘
              ↓
┌─────────────────────────────────────────┐
│  DQ CHECKS (Monitoring: Non-Blocking)   │
│  • Row count within expected range      │
│  • Anomaly detection on metrics         │
│  • Cross-table consistency              │
└─────────────────────────────────────────┘
              ↓
┌─────────────────────────────────────────┐
│  PROFILES (Observation: Metrics)        │
│  • Track null percentages               │
│  • Monitor column distributions         │
│  • Detect data drift                    │
└─────────────────────────────────────────┘
```

All three layers are configured in `models/dq/` YAML files using `kind: dq`. Assertions and DQ rules can coexist; assertions are defined in the `MODEL()` block or `.sql` files, while DQ rule packs are separate YAML files.

***

## File structure

One file per model, placed in `dq/`:

```
dq/
├── fct_daily_sales.yml
├── fct_weekly_sales.yml
├── dim_customer_profile.yml
├── dim_product_profile.yml
├── rfm_customer_segmentation.yml
├── sales_funnel_analysis.yml
└── referential_integrity.yml
```

***

## Worked example: DQ pack for daily sales

`dq/fct_daily_sales.yml` from `orders-analytics` monitors the `silver.fct_daily_sales` table after each run:

```yaml
kind: dq
name: fct_daily_sales_dq
depends_on: silver.fct_daily_sales

profiles:
  - order_date
  - region_name
  - category
  - total_revenue
  - shipment_rate

rules:
  - row_count >= 20:
      name: minimum_daily_sales_rows
      dimension: completeness
      description: Daily sales should contain at least 20 rows from the seed data
  - missing_count(order_date) = 0:
      name: no_missing_order_date
      dimension: completeness
      description: Order date is required
  - missing_count(customer_id) = 0:
      name: no_missing_customer_id
      dimension: completeness
      description: Customer id is required
  - invalid_count(total_revenue) = 0:
      valid min: 0
      name: total_revenue_non_negative
      dimension: validity
      description: Revenue must be non-negative
  - invalid_count(shipment_rate) = 0:
      valid min: 0
      valid max: 1
      name: shipment_rate_between_zero_and_one
      dimension: validity
      description: Shipment rate must be between 0 and 1
```

What this DQ pack does:

* Profiles five columns. Vulcan collects null counts, distinct counts, min, max, and distribution stats for each on every run. These are stored and viewable over time.
* Checks that at least 20 rows exist after each run.
* Checks that `order_date` and `customer_id` are never null.
* Uses `invalid_count` with a `valid min`/`valid max` range to check that `total_revenue` is non-negative and `shipment_rate` is between 0 and 1.

These rules do not block the model if they fail. They surface as warnings and build a historical record of data quality over time.

***

## Basic syntax

```yaml
kind: dq
name: <name>_dq
depends_on: <schema.table_name>

rules:
  - missing_count(column_name) = 0:
      name: <rule_name>
      dimension: completeness
      description: "<what this checks>"
```

**Required fields:**

* `kind: dq`: declares this as a DQ rule pack.
* `name`: unique identifier for the pack.
* `depends_on`: the fully-qualified model this pack validates.
* `rules`: list of rules (shorthand or full form).

***

## Rule forms

**Shorthand**: just the expression, no metadata:

```yaml
rules:
  - missing_count(user_id) = 0
  - duplicate_count(order_id) = 0
  - row_count > 1000
```

**Full form**: expression as YAML key with metadata:

```yaml
rules:
  - missing_count(email) = 0:
      name: no_missing_emails
      dimension: completeness
      description: "All customers must have an email"
      severity: error
      tags: [critical, daily]
      owner: data-team
```

***

## Built-in rule types

### Missing data

```yaml
rules:
  - missing_count(email) = 0:
      dimension: completeness
  - missing_percent(phone) < 5:
      dimension: completeness
```

### Row count

```yaml
rules:
  - row_count > 1000:
      dimension: completeness
  - row_count between 5000 and 15000:
      dimension: completeness
```

### Uniqueness

```yaml
rules:
  - duplicate_count(email) = 0:
      dimension: uniqueness
  - duplicate_count(customer_id, order_date) = 0:
      dimension: uniqueness
      description: "One order per customer per day"
```

### Custom SQL (failed rows)

Write any SQL that returns the invalid rows. The rule fails if the query returns any rows:

```yaml
rules:
  - failed rows:
      name: invalid_emails
      dimension: validity
      fail query: |
        SELECT user_id, email
        FROM analytics.users
        WHERE email NOT LIKE '%@%'
      samples limit: 10
```

### Numeric aggregations

```yaml
rules:
  - avg(revenue) between 100 and 10000:
      dimension: accuracy
  - min(price) >= 0:
      dimension: validity
```

### Anomaly detection and change monitoring

Vulcan also supports anomaly detection rules (which learn a baseline from previous runs and flag deviations) and change monitoring rules (which detect sudden drops or spikes between runs). `orders-analytics` does not use these patterns. For the full reference, see [Data Quality](/concepts/resources/vulcan/components/data-quality.md) in the Vulcan book.

***

## Data quality dimensions

Use `dimension:` to classify what aspect of quality a rule measures:

| Dimension      | What it measures                         |
| -------------- | ---------------------------------------- |
| `completeness` | No missing required data                 |
| `validity`     | Data conforms to format or syntax rules  |
| `accuracy`     | Data matches expected ranges or patterns |
| `consistency`  | Data agrees across sources               |
| `uniqueness`   | No duplicate records                     |
| `timeliness`   | Data is fresh and up-to-date             |
| `conformity`   | Follows defined standards                |
| `coverage`     | All expected records are present         |

***

## Profiling

Profiles automatically collect statistical metrics about your columns over time: null count, distinct count, distribution, min, max, average. They observe and track rather than validate.

Enable profiling alongside rules in the same file:

```yaml
kind: dq
name: orders_dq
depends_on: analytics.orders

profiles:
  - revenue
  - order_count
  - customer_tier

rules:
  - row_count > 1000:
      dimension: completeness
```

Profiles are stored in the `_check_profiles` table. Query them to understand what is normal for your data, then use that to set informed rule thresholds.

***

## Filtering

Vulcan supports pack-level and per-rule SQL filters to focus rules on a subset of rows. `orders-analytics` does not use filters in its DQ files. For the full reference, see [Data Quality](/concepts/resources/vulcan/components/data-quality.md) in the Vulcan book.

***

## Running DQ checks

DQ checks run automatically when models execute via `vulcan plan` or `vulcan run`. Run them manually:

```bash
# Run all checks
vulcan check

# Run checks for a specific model
vulcan check --select analytics.daily_sales
```


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://v2.dataos.info/build/stage-2-productize/define-the-contract/data-quality.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
