> For the complete documentation index, see [llms.txt](https://v2.dataos.info/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://v2.dataos.info/build/get-started/build-your-first-data-product.md).

# Build your first Data Product

### The challenge

A retail team sells in 4 regions: US-West, US-East, EU-West, and APAC. Every week, leadership asks:

* How much did we sell?
* Which regions are ahead?
* Can we trust the numbers?

The data exists, but it is split across 3 files: customers, orders, and line items. Different people pull different exports, formulas do not match, and nobody is sure which total is correct.

### What you will build

You build a Data Product: a governed business asset on top of the source data. A single unit where:

* **CSV files are defined as seed inputs** and materialized in the chosen engine.
* **Rules are enforced** to catch bad or inconsistent rows.
* **Business terms are defined once** (region, revenue, order count) instead of being redefined in every spreadsheet.
* **One metric is published,** so everyone uses the same definition of *daily revenue by region*.

Dashboards, reports, and AI assistants then read from that layer instead of rebuilding the logic from scratch.

Use **Vulcan** to define, validate, and test the artifacts of a Data Product locally before publishing them to DataOS. You work inside a Vulcan project scaffold with static CSV inputs to build the **Orders Analytics Data Product**: transformation models, quality gates (blocking assertions and warning checks), tests, semantic definitions, and business metrics. Vulcan runs locally, so you can plan, run, and verify the project before moving it to DataOS.

Follow the steps in order.

***

### Expected outcome

By the end of this guide, you will have a minimum working Data Product built from raw customer, order, and order-item inputs. These inputs are defined as `SEED` models and transformed into a reusable enriched view that combines customer and order details with order-line data.

You also add quality checks through model assertions and post-transformation checks, define a semantic layer with dimensions, measures, segments, and joins across the 3 entities, and create a daily revenue metric by customer region. Once deployed, the Data Product is queryable through the API surfaces Vulcan exposes, such as REST, GraphQL, and the MySQL wire protocol.

***

### Before you start

* [ ] Terminal open at the project root: `cd my-first-dp`
* [ ] Text editor to create the files required

***

## Part A: Environment setup

{% stepper %}
{% step %}

### Follow the steps for [LDK setup](/build/readme/ldk-setup.md).

Install Vulcan, configure the engine, initialize the project, and understand the generated structure.

Also, complete the Postgres engine-specific configuration during Vulcan installation.
{% endstep %}

{% step %}

### Initialize the Vulcan project

From the project root, run:

```bash
vulcan init
```

Vulcan prompts you for a few choices. For this example, select:

| Prompt               | Select     | Why                                                           |
| -------------------- | ---------- | ------------------------------------------------------------- |
| **Project type**     | `DEFAULT`  | Standard data-product layout (models, seeds, checks, config)  |
| **Execution engine** | `postgres` | Matches the `warehouse` / `statestore` containers from Step 2 |

<details>

<summary><strong><code>vulcan init</code> creates</strong></summary>

```
my-vulcan-project/
├── config.yaml
├── usage.yaml
├── audits/
├── dq/
│   └── full_model.yml
├── macros/
│   └── __init__.py
├── models/
│   ├── full_model.sql
│   ├── incremental_model.sql
│   ├── seed_model.sql
│   ├── metrics/
│   │   └── event_activity.yml
│   └── semantics/
│       └── incremental_model.yml
├── seeds/
│   └── seed_data.csv
└── tests/
    └── test_full_model.yaml
```

</details>

**Success:** `config.yaml` exists, and you have `models/`, `seeds/`, and `dq/` directories ready.
{% endstep %}

{% step %}

### Configure `config.yaml`

Open `config.yaml` that `vulcan init` created (or that already exists in this repo). Set `name: workspace` so models use the `workspace.*` schema. Edit your config.yaml with the values given below.

<details>

<summary><strong><code>config.yaml</code></strong></summary>

```yaml
name: workspace
tenant: default
description: Order and revenue analytics data product.
gateways:
  postgres:
    connection:
      type: postgres
      host: warehouse
      port: 5432
      database: warehouse
      user: vulcan
      password: vulcan
    state_connection:
      type: postgres
      host: statestore
      port: 5432
      database: statestore
      user: vulcan
      password: vulcan

default_gateway: postgres

model_defaults:
  dialect: postgres
  start: 2025-01-01
  cron: '@daily'

linter:
  enabled: true
  rules:
    - ambiguousorinvalidcolumn
    - invalidselectstarexpansion
    - noambiguousprojections
```

</details>

**Check:**

* [ ] `host` is `warehouse` / `statestore`
* [ ] Save the file before continuing.
  {% endstep %}

{% step %}

### Verify Vulcan can connect

```bash
vulcan info
```

<details>

<summary><strong>Expected output</strong></summary>

```shellscript
my-first-dp % vulcan info
Models: 3
Macros: 0
Data warehouse connection succeeded
State backend connection succeeded
```

</details>

**Success:** No `could not translate host name "statestore"` error.

**Do not continue** until this works.
{% endstep %}
{% endstepper %}

***

## Part B: Build the data product

{% file src="/files/ZA5qYZ41qaxdrpPCDl7j" %}

Download the sample files and follow the steps given below.

{% stepper %}
{% step %}

### Add seed data (CSV files)

Place three CSV files under `seeds/`. They are the **source of truth** for this product.

{% tabs %}
{% tab title="raw\_customers.csv" %}
Path: `seeds/raw_customers.csv`

Customer identity and region lookup.

| Column        | Type    | Example                                 |
| ------------- | ------- | --------------------------------------- |
| `customer_id` | integer | `1`                                     |
| `name`        | text    | `Alice Johnson`                         |
| `email`       | text    | `alice.johnson@example.com`             |
| `region`      | text    | `US-West`, `US-East`, `EU-West`, `APAC` |
| {% endtab %}  |         |                                         |

{% tab title="raw\_orders.csv" %}
Path: `seeds/raw_orders.csv`

Order header data linked to customers.

| Column        | Type    | Example                       |
| ------------- | ------- | ----------------------------- |
| `order_id`    | integer | `1001`                        |
| `customer_id` | integer | `1` (must exist in customers) |
| `order_date`  | date    | `2025-01-05`                  |
| {% endtab %}  |         |                               |

{% tab title="raw\_order\_items.csv" %}
Path: `seeds/raw_order_items.csv`

Line-level order details used for quantity and revenue.

| Column        | Type    | Example                       |
| ------------- | ------- | ----------------------------- |
| `order_id`    | integer | `1001` (must exist in orders) |
| `item_id`     | integer | `1`                           |
| `product_id`  | text    | `P001`                        |
| `quantity`    | integer | `2`                           |
| `unit_price`  | decimal | `29.99`                       |
| {% endtab %}  |         |                               |
| {% endtabs %} |         |                               |

**Success:** All three files exist with a header row and at least one data row. Keep dates in **January 2025** so they align with `model_defaults.start`.
{% endstep %}

{% step %}

### Add SEED models

SEED models load CSVs into the warehouse. Create **one `.sql` file per table** under `models/`.

{% tabs %}
{% tab title="raw\_customers.sql" %}
Path: `models/raw_customers.sql`

```sql
MODEL (
  name workspace.raw_customers,
  kind SEED (
    path '../seeds/raw_customers.csv'
  ),
  columns (
    customer_id INTEGER,
    name TEXT,
    email TEXT,
    region TEXT
  ),
  grain customer_id,
  assertions (
    not_null(columns := (customer_id, name, email, region)),
  ),
);
```

{% endtab %}

{% tab title="raw\_orders.sql" %}
Path: `models/raw_orders.sql`

```sql
MODEL (
  name workspace.raw_orders,
  kind SEED (
    path '../seeds/raw_orders.csv'
  ),
  columns (
    order_id INTEGER,
    customer_id INTEGER,
    order_date DATE
  ),
  grain order_id,
  references customer_id,
  assertions (
    not_null(columns := (order_id, customer_id, order_date)),
  ),
);
```

`references customer_id` links orders to the customer grain for joins and semantics.
{% endtab %}

{% tab title="raw\_order\_items.sql" %}
Path: `models/raw_order_items.sql`

```sql
MODEL (
  name workspace.raw_order_items,
  kind SEED (
    path '../seeds/raw_order_items.csv'
  ),
  columns (
    order_id INTEGER,
    item_id INTEGER,
    product_id TEXT,
    quantity INTEGER,
    unit_price DOUBLE PRECISION
  ),
  grain (order_id, item_id),
  references order_id,
  assertions (
    not_null(columns := (order_id, item_id, product_id, quantity, unit_price)),
    forall(criteria := (quantity > 0)),
    forall(criteria := (unit_price >= 0)),
  ),
);
```

{% endtab %}
{% endtabs %}

**Assertions** block bad rows when the model materializes (nulls, non-positive quantities, negative prices).

Assertions are declared inside the `MODEL(...)` definition and block execution if they fail. They are your first line of defence.

**Success:** Three files under `models/`. Run a quick lint preview:

```bash
vulcan plan
```

You should see three models to add.
{% endstep %}

{% step %}

### Transformations

Create `models/order_lines_enriched.sql` as a `VIEW` model. It joins the 3 seed tables into one analytics-ready dataset with customer, order, and line-level revenue details:

<details>

<summary><strong><code>models/order_lines_enriched.sql</code></strong></summary>

```sql
MODEL (
  name workspace.order_lines_enriched,
  kind VIEW,
  grain (order_id, item_id),
  assertions (
    not_null(columns := (
      order_id,
      item_id,
      customer_id,
      order_date,
      region,
      line_revenue
    )),
    forall(criteria := (line_revenue >= 0)),
  ),
);

SELECT
  i.order_id,
  i.item_id,
  i.product_id,
  i.quantity,
  i.unit_price,
  (i.quantity * i.unit_price)::DOUBLE PRECISION AS line_revenue,
  o.customer_id,
  o.order_date,
  c.name AS customer_name,
  c.email AS customer_email,
  c.region
FROM workspace.raw_order_items AS i
INNER JOIN workspace.raw_orders AS o
  ON i.order_id = o.order_id
INNER JOIN workspace.raw_customers AS c
  ON o.customer_id = c.customer_id;
```

</details>

**Success:** File exists under `models/`. `vulcan plan` should show `workspace.order_lines_enriched` as a new VIEW (depends on the three seeds).
{% endstep %}

{% step %}

### Add quality checks

Checks are non-blocking and observable. They run after model execution and flag business-rule violations. Create **one file per model** under `dq/`.

<details>

<summary><strong><code>dq/raw_customers.yml</code></strong></summary>

```yaml
checks:
  workspace.raw_customers:
    completeness:
      - missing_count(customer_id) = 0:
          name: customer_id_not_null
          attributes:
            description: "Customer ID must not be null"
      - missing_count(name) = 0:
          name: name_not_null
          attributes:
            description: "Customer name must not be null"
      - missing_count(email) = 0:
          name: email_not_null
          attributes:
            description: "Customer email must not be null"
      - row_count > 0:
          name: has_rows
          attributes:
            description: "Customers seed must have at least one row"
    validity:
      - failed rows:
          name: positive_customer_id
          fail query: |
            SELECT customer_id, name
            FROM workspace.raw_customers
            WHERE customer_id <= 0
          samples limit: 10
          attributes:
            description: "Customer ID must be positive"
      - failed rows:
          name: unique_customer_id
          fail query: |
            SELECT customer_id, COUNT(*) AS duplicate_count
            FROM workspace.raw_customers
            GROUP BY customer_id
            HAVING COUNT(*) > 1
          samples limit: 10
          attributes:
            description: "Customer ID must be unique"
```

</details>

<details>

<summary><strong><code>dq/raw_orders.yml</code></strong></summary>

```yaml
checks:
  workspace.raw_orders:
    completeness:
      - missing_count(order_id) = 0:
          name: order_id_not_null
          attributes:
            description: "Order ID must not be null"
      - missing_count(customer_id) = 0:
          name: customer_id_not_null
          attributes:
            description: "Order customer ID must not be null"
      - missing_count(order_date) = 0:
          name: order_date_not_null
          attributes:
            description: "Order date must not be null"
      - row_count > 0:
          name: has_rows
          attributes:
            description: "Orders seed must have at least one row"
    validity:
      - failed rows:
          name: positive_order_id
          fail query: |
            SELECT order_id, customer_id
            FROM workspace.raw_orders
            WHERE order_id <= 0
          samples limit: 10
          attributes:
            description: "Order ID must be positive"
      - failed rows:
          name: unique_order_id
          fail query: |
            SELECT order_id, COUNT(*) AS duplicate_count
            FROM workspace.raw_orders
            GROUP BY order_id
            HAVING COUNT(*) > 1
          samples limit: 10
          attributes:
            description: "Order ID must be unique"
      - failed rows:
          name: customer_exists
          fail query: |
            SELECT o.order_id, o.customer_id
            FROM workspace.raw_orders AS o
            LEFT JOIN workspace.raw_customers AS c
              ON o.customer_id = c.customer_id
            WHERE c.customer_id IS NULL
          samples limit: 10
          attributes:
            description: "Every order must reference an existing customer"
```

</details>

<details>

<summary><strong><code>dq/raw_order_items.yml</code></strong></summary>

```yaml
checks:
  workspace.raw_order_items:
    completeness:
      - missing_count(order_id) = 0:
          name: order_id_not_null
          attributes:
            description: "Order ID must not be null"
      - missing_count(item_id) = 0:
          name: item_id_not_null
          attributes:
            description: "Item ID must not be null"
      - missing_count(quantity) = 0:
          name: quantity_not_null
          attributes:
            description: "Quantity must not be null"
      - row_count > 0:
          name: has_rows
          attributes:
            description: "Order items seed must have at least one row"
    validity:
      - failed rows:
          name: positive_quantity
          fail query: |
            SELECT order_id, item_id, quantity
            FROM workspace.raw_order_items
            WHERE quantity <= 0
          samples limit: 10
          attributes:
            description: "Quantity must be greater than zero"
      - failed rows:
          name: non_negative_unit_price
          fail query: |
            SELECT order_id, item_id, unit_price
            FROM workspace.raw_order_items
            WHERE unit_price < 0
          samples limit: 10
          attributes:
            description: "Unit price must not be negative"
      - failed rows:
          name: unique_order_item
          fail query: |
            SELECT order_id, item_id, COUNT(*) AS duplicate_count
            FROM workspace.raw_order_items
            GROUP BY order_id, item_id
            HAVING COUNT(*) > 1
          samples limit: 10
          attributes:
            description: "Order ID and item ID combination must be unique"
      - failed rows:
          name: order_exists
          fail query: |
            SELECT i.order_id, i.item_id
            FROM workspace.raw_order_items AS i
            LEFT JOIN workspace.raw_orders AS o
              ON i.order_id = o.order_id
            WHERE o.order_id IS NULL
          samples limit: 10
          attributes:
            description: "Every order item must reference an existing order"
```

</details>

**Success:** Three YAML files under `dq/`. `vulcan plan` should list new checks without validation errors.
{% endstep %}

{% step %}

### Add semantic models

The semantic layer maps physical tables to business concepts. It powers the REST, GraphQL, and MySQL wire-protocol APIs automatically; you write no API code.

Semantic models define **dimensions**, **measures**, **segments**, and **relationships** (`joins`). Create **one file per entity** under `models/semantics/`.

<details>

<summary><strong>Relationship map</strong></summary>

```
raw_customers  ←── raw_orders  ←── raw_order_items
   (1)              (many:1)         (many:1)
```

</details>

<details>

<summary><strong><code>semantics/raw_customers.yml</code> (dimension)</strong></summary>

```yaml
kind: semantic
name: raw_customers
depends_on: workspace.raw_customers
description: >
  Customer dimension for slicing orders and line items by identity and region.

dimensions:
  - customer_id
  - name
  - email
  - region

measures:
  - name: customer_count
    type: count_distinct
    expression: "{raw_customers.customer_id}"
    description: Distinct customers in the slice.

segments:
  - name: us_west
    expression: "{raw_customers.region} = 'US-West'"
    description: Customers in the US-West region.
  - name: us_east
    expression: "{raw_customers.region} = 'US-East'"
    description: Customers in the US-East region.
  - name: eu_west
    expression: "{raw_customers.region} = 'EU-West'"
    description: Customers in the EU-West region.
  - name: apac
    expression: "{raw_customers.region} = 'APAC'"
    description: Customers in the APAC region.
```

</details>

<details>

<summary><strong><code>semantics/raw_orders.yml</code> (order headers)</strong></summary>

```yaml
kind: semantic
name: raw_orders
depends_on: workspace.raw_orders
description: >
  Order header fact for order volume and order-date analysis by customer.

dimensions:
  - order_id
  - customer_id
  - order_date

joins:
  - name: raw_customers
    type: many_to_one
    expression: "{raw_orders.customer_id} = {raw_customers.customer_id}"

measures:
  - name: total_orders
    type: count_distinct
    expression: "{raw_orders.order_id}"
    description: Distinct orders in the slice.

  - name: total_customers_with_orders
    type: count_distinct
    expression: "{raw_orders.customer_id}"
    description: Distinct customers who placed at least one order in the slice.

segments:
  - name: january_2025
    expression: "{raw_orders.order_date} >= DATE '2025-01-01' AND {raw_orders.order_date} < DATE '2025-02-01'"
    description: Orders placed in January 2025.
  - name: mid_month_orders
    expression: "{raw_orders.order_date} >= DATE '2025-01-10' AND {raw_orders.order_date} < DATE '2025-01-20'"
    description: Orders placed between Jan 10 and Jan 19, 2025.
```

</details>

<details>

<summary><strong><code>semantics/raw_order_items.yml</code> (line-level fact)</strong></summary>

```yaml
kind: semantic
name: raw_order_items
depends_on: workspace.raw_order_items
description: >
  Order line fact for revenue, quantity, and product mix analysis.

dimensions:
  - order_id
  - item_id
  - product_id
  - quantity
  - unit_price

joins:
  - name: raw_orders
    type: many_to_one
    expression: "{raw_order_items.order_id} = {raw_orders.order_id}"
  - name: raw_customers
    type: many_to_one
    expression: "{raw_orders.customer_id} = {raw_customers.customer_id}"

measures:
  - name: line_count
    type: count
    expression: "{raw_order_items.item_id}"
    description: Number of order lines in the slice.

  - name: total_quantity
    type: sum
    expression: "{raw_order_items.quantity}"
    description: Total units sold across order lines.

  - name: total_line_revenue
    type: sum
    expression: "{raw_order_items.quantity} * {raw_order_items.unit_price}"
    description: Gross line revenue (quantity times unit price).

  - name: avg_unit_price
    type: avg
    expression: "{raw_order_items.unit_price}"
    description: Average unit price across lines.

  - name: distinct_products
    type: count_distinct
    expression: "{raw_order_items.product_id}"
    description: Distinct products sold in the slice.

  - name: distinct_orders
    type: count_distinct
    expression: "{raw_order_items.order_id}"
    description: Distinct orders with at least one line in the slice.

segments:
  - name: premium_lines
    expression: "{raw_order_items.unit_price} >= 50"
    description: Order lines priced at 50 or above.
  - name: bulk_quantity
    expression: "{raw_order_items.quantity} >= 3"
    description: Order lines with quantity of three or more.
  - name: product_p001
    expression: "{raw_order_items.product_id} = 'P001'"
    description: Lines for product P001.
```

</details>

**Success:** Three files under `semantics/`. `depends_on` must match model names (`workspace.raw_*`). Join `name` values must match other semantic `name` fields (`raw_customers`, `raw_orders`).
{% endstep %}

{% step %}

### Add a metric

Metrics combine a **measure**, a **time column** (`ts`), and **granularity**. Create:

<details>

<summary><strong><code>models/metrics/daily_revenue_by_region.yml</code></strong></summary>

```yaml
kind: metric
name: daily_revenue_by_region
measure: raw_order_items.total_line_revenue
ts: raw_orders.order_date
granularity: day

dimensions:
  - raw_customers.region

description: Daily gross revenue from order lines, broken out by customer region.
```

</details>

**Rules:**

* `measure` comes from semantic `raw_order_items.total_line_revenue`
* `ts` is `raw_orders.order_date` (must differ from the measure field)
* `granularity: day` buckets revenue by calendar day
* `dimensions` uses joined semantic `raw_customers.region`

**Success:** File exists under `metrics/`. `vulcan plan` validates the metric against semantics.
{% endstep %}
{% endstepper %}

## Part C: Test and validate

{% stepper %}
{% step %}

### Run your plan

```bash
vulcan plan
```

Vulcan validates your models and computes what needs to be materialized in the source engine.

**Review the plan.** You should see:

| Category      | Added                                                                                                           |
| ------------- | --------------------------------------------------------------------------------------------------------------- |
| **Models**    | `workspace.raw_customers`, `workspace.raw_orders`, `workspace.raw_order_items , workspace.order_lines_enriched` |
| **Checks**    | Completeness and validity for each model                                                                        |
| **Semantics** | `raw_customers`, `raw_orders`, `raw_order_items`                                                                |
| **Metrics**   | `daily_revenue_by_region`                                                                                       |

When Vulcan asks whether to apply the virtual update, type `y`:

```
Apply - Virtual Update [y/n]:
```

**Success:** Plan finishes with **no** `Error:` line.

| Error                                           | Fix                                                     |
| ----------------------------------------------- | ------------------------------------------------------- |
| `could not translate host name "statestore"`    | Repeat Steps 2–3                                        |
| `Start date ... can't be greater than end date` | Set `start: 2025-01-01` in `config.yaml`                |
| Missing CSV / wrong path                        | Check `path '../seeds/...'` in SEED models              |
| Semantic / metric validation error              | Fix `depends_on`, join names, or `{model.column}` typos |

**You do not need `vulcan run`** for seed-only models.

Confirm deployment:

```bash
vulcan plan
```

No pending model changes and no errors indicate that the models have been successfully materialized in the source engine.
{% endstep %}

{% step %}

### Run data-quality checks

```bash
vulcan audit
```

**Success:** All checks on `workspace.raw_customers`, `workspace.raw_orders`, and `workspace.raw_order_items` pass.

**If a check fails:** Read the sample rows and fix `seeds/` or the `fail query`, then repeat the step.
{% endstep %}
{% endstepper %}

***

### Project layout

<details>

<summary><strong>Project layout</strong></summary>

```
my-first-dp/
├── config.yaml
├── seeds/
│   ├── raw_customers.csv
│   ├── raw_orders.csv
│   └── raw_order_items.csv
├── models/
│   ├── raw_customers.sql
│   ├── raw_orders.sql
│   └── raw_order_items.sql
│   └── order_lines_enriched.sql   # VIEW
├── dq/
│   ├── raw_customers.yml
│   ├── raw_orders.yml
│   └── raw_order_items.yml
├── semantics/
│   ├── raw_customers.yml
│   ├── raw_orders.yml
│   └── raw_order_items.yml
├── metrics/
│   └── daily_revenue_by_region.yml
└── BUILD_DATA_PRODUCT.md
```

</details>

***

### Next Steps

* **Deployment:** You have your models ready and tested. Refer to the deployment steps to move from local Docker to a DataOS instance.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://v2.dataos.info/build/get-started/build-your-first-data-product.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
