> For the complete documentation index, see [llms.txt](https://v2.dataos.info/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://v2.dataos.info/build/get-started/build-journey-with-ai.md).

# Build Journey with AI

Building a data product normally means reading docs, writing SQL by hand, figuring out YAML syntax, and keeping dozens of files consistent. The Data Product Model Context Protocol (MCP) turns this into a conversation. You describe what you want. The AI writes the files, catches mistakes, and explains every step.

This page walks through that conversation end to end, using the `orders-analytics` data product as the example: a governed e-commerce analytics product for a sales operations team on PostgreSQL.

{% hint style="info" %}
The chat examples below are illustrative. They show how the MCP can guide the workflow, not a fixed script. Real conversations vary based on the user's context, requirements, and selected model. AI chats are non-deterministic, so the wording, sequence, and depth can differ from one run to another.
{% endhint %}

***

## Prerequisites

Setting up the build journey has two parts: **scaffold the skills, docs, and CLI wheel into your project**, then **connect the Data Product MCP** to your AI agent or tool.

You need:

* Your DataOS instance URL, such as `pacific-051426.dataos.cloud`.
* A DataOS API token. Generate one from **Home → Generate API tokens** in your DataOS instance.
* Node.js ≥ 16 (to run the scaffolding command below).
* Vulcan CLI installed in a **Python 3.10** virtual environment, from the wheel the scaffolding command provides (see Step 1).

### Step 1: Scaffold the skills, docs, and CLI wheel

Run the scaffolding command in your project directory, passing the engine you are building for:

```bash
npx dataproduct-builder-skills <engine>   # e.g. snowflake, postgres, bigquery
# or run it with no argument to pick the engine interactively
```

This drops everything the agent and CLI need into your project:

```
.cursor/skills/
  design-data-product/         ← guides the design conversation (Parts 1–3)
  build-data-product-workflow/ ← turns the spec into a deployed product (Parts 4–6)
docs/
  vulcan-book/                 ← Vulcan CLI & framework reference the agent reads
  vulcan-examples/<engine>/    ← real working examples for your engine
  dataos-philosophy/           ← DataOS core concepts
  vulcan-*.whl                 ← the Vulcan CLI wheel you install next
```

Then install the CLI **from the scaffolded wheel** into a Python 3.10 virtual environment (include your engine as an extra):

```bash
python3.10 -m venv .venv
source .venv/bin/activate
pip install "docs/vulcan-*.whl[<engine>]"   # e.g. docs/vulcan-*.whl[snowflake]
vulcan --version
```

{% hint style="warning" %}
**Python 3.10 is required.** The wheel uses 3.10-only syntax and will crash on 3.9 with `TypeError: unsupported operand type(s) for |`. For full environment setup (Docker, engine connection options), see [LDK Setup](/build/readme/ldk-setup.md).
{% endhint %}

### Step 2: Connect the Data Product MCP

{% tabs %}
{% tab title="Cursor (this guide)" %}
This walkthrough uses Cursor. Follow these steps to add the Data Product MCP.

{% stepper %}
{% step %}

#### Open the MCP configuration

From the Data Product MCP page in DataOS, select **Cursor**. The modal shows a JSON configuration block. Replace `<API_TOKEN>` with your DataOS API token and `<instance-url>` with your DataOS instance URL.

```json
{
  "mcpServers": {
    "DataProduct-MCP": {
      "url": "https://<instance-url>/dataproduct-mcp/api/v1",
      "headers": {
        "apikey": "<API_TOKEN>"
      }
    }
  }
}
```

{% endstep %}

{% step %}

#### Open Cursor from DataOS

Click **Open in Cursor**. Approve the browser prompt to continue the MCP setup in Cursor.
{% endstep %}

{% step %}

#### Install the MCP server

Cursor opens **Tools & MCPs** with the server fields populated. Confirm the following values, then click **Install**.

| Field         | Value                                           |
| ------------- | ----------------------------------------------- |
| Name          | `DataProduct-MCP`                               |
| Type          | `streamableHttp`                                |
| URL           | `https://<instance-url>/dataproduct-mcp/api/v1` |
| Header key    | `apikey`                                        |
| Header value  | Your DataOS API token                           |
| {% endstep %} |                                                 |

{% step %}

#### Enable the server

Under **Installed MCP Servers**, turn on **DataProduct-MCP**.
{% endstep %}
{% endstepper %}

**Verify:** Open a new Cursor chat and ask `What data products are available to consume?`. If the connection works, Cursor returns Data Products from your DataOS instance.

For the complete Cursor setup guide with screenshots, see [Connect Cursor](/consume/consume-with-ai/connect-clients/cursor.md).
{% endtab %}

{% tab title="Claude" %}
Follow the Claude-specific setup guide: [Connect Claude](/consume/consume-with-ai/connect-clients/claude.md).
{% endtab %}

{% tab title="Copilot in VS Code" %}
Follow the Copilot in VS Code setup guide: Connect [Copilot in VS Code.](/consume/consume-with-ai/connect-clients/copilot-vs-code.md)
{% endtab %}

{% tab title="Codex Desktop" %}
Follow the Codex Desktop setup guide: [Connect Codex Desktop](/consume/consume-with-ai/connect-clients/codex-desktop.md).
{% endtab %}

{% tab title="Other frameworks" %}
For agentic frameworks such as LangChain, Vercel AI SDK, and AgentScope, see [Connect an agentic framework.](/consume/consume-with-ai/connect-clients/custom.md)
{% endtab %}
{% endtabs %}

***

## What you will build

`orders-analytics` is a Vulcan data product that turns raw order, customer, product, shipment, warehouse, supplier, and region data into trusted business metrics.

|                  |                                                                                                      |
| ---------------- | ---------------------------------------------------------------------------------------------------- |
| **Engine**       | PostgreSQL                                                                                           |
| **Architecture** | Medallion: Bronze, Silver, Gold, Semantic Layer, Metrics                                             |
| **Metrics**      | Daily revenue, weekly revenue, customer lifetime value, RFM value by segment, fulfillment conversion |
| **Total files**  | 43 governed files across SQL models, semantic YAMLs, DQ rules, assertions, and tests                 |

## The journey

| Phase    | Goal                     | What the MCP does                                   | What you provide                                          |
| -------- | ------------------------ | --------------------------------------------------- | --------------------------------------------------------- |
| Discover | Find what already exists | Searches the catalog and table index                | A topic or domain                                         |
| Design   | Frame the problem        | Asks business questions; works right to left        | Pain points, consumers, key questions                     |
| Plan     | Lock the spec            | 3 batches of Q\&A build `data-product-plan.md`      | Answers about entities, joins, measures, grain, freshness |
| Build    | Generate the files       | Writes all SQL, YAML, DQ, assertion, and test files | Confirmation to proceed                                   |
| Validate | Confirm it works         | Guides you through vulcan commands                  | Review of command results                                 |

***

## Part 1: Discover

Before building anything, check what already exists in the data product catalog. You might find something to reference directly rather than rebuild. Even if you don't, you get a clear picture of the raw tables available before committing to any design.

The MCP searches the data product catalog and the table index.

### Step 1: search the data product catalog

> **You:**\
> Before I start building, can you check if there's anything already in the data product catalog related to orders or e-commerce analytics? No point rebuilding something that already exists.

> **Agent:**\
> Searched the catalog for "orders", "e-commerce", and "sales analytics".\
> No existing governed data products found matching this domain. The catalog is clear. Moving on to find the raw source tables.

{% hint style="info" %}
If the catalog returns a match: say, a `customer-intelligence` data product that already exposes customer segments: evaluate its metrics, quality status, and lineage before deciding. If it covers your need as-is, reference it as `EXTERNAL`. If you need to modify it, build your own model on top.
{% endhint %}

### Step 2: find the raw source tables

The MCP searches the table index for your data source. This confirms what columns exist before you commit to any model design.

{% hint style="warning" %}
Data Discovery Disclaimer: Table search and profiling are experimental features. Results may be incomplete or reflect stale data. Treat all table recommendations and profile statistics as directional signals and confirm before proceeding.
{% endhint %}

> **Agent:**\
> Found 9 raw tables in your PostgreSQL source schema `public`:
>
> All tables are owned by the raw source system.

| Table             | Grain               | Key Columns                                                        | Purpose                |
| ----------------- | ------------------- | ------------------------------------------------------------------ | ---------------------- |
| orders\_ext       | order\_id           | order\_id, customer\_id, order\_date, warehouse\_id, order\_status | Order transactions     |
| order\_items\_ext | order\_id, item\_id | order\_id, product\_id, quantity, unit\_price                      | Line-item sales detail |
| customers\_ext    | customer\_id        | customer\_id, region\_id, name, email, signup\_date                | Customer master        |
| products\_ext     | product\_id         | product\_id, supplier\_id, name, category, price                   | Product catalog        |
| regions\_ext      | region\_id          | region\_id, region\_name                                           | Geography              |
| shipments\_ext    | shipment\_id        | shipment\_id, order\_id, shipped\_date, carrier                    | Fulfillment events     |
| warehouses\_ext   | warehouse\_id       | warehouse\_id, region\_id, name                                    | Warehouse locations    |
| suppliers\_ext    | supplier\_id        | supplier\_id, region\_id, name                                     | Supplier master        |
| dim\_dates\_ext   | dt                  | dt, year, month, day\_of\_week                                     | Calendar dimension     |

> **You:**\
> Those all look right. We also maintain one seed file ourselves: `seeds/order_status_lookup.csv`. It maps raw status codes to business labels and flags which ones count as revenue-bearing.

Plan: everything in `*_ext` stays `EXTERNAL`. The status CSV is a `SEED`. Everything we build on top is owned by us.

### Discovery outcome

| Source                    | Owned By          | Vulcan Kind | Used For                           |
| ------------------------- | ----------------- | ----------- | ---------------------------------- |
| orders\_ext               | Raw source system | EXTERNAL    | Orders Bronze model                |
| order\_items\_ext         | Raw source system | EXTERNAL    | Order Items Bronze model           |
| customers\_ext            | Raw source system | EXTERNAL    | Customers Bronze model             |
| products\_ext             | Raw source system | EXTERNAL    | Products Bronze model              |
| regions\_ext              | Raw source system | EXTERNAL    | Regions Bronze model               |
| shipments\_ext            | Raw source system | EXTERNAL    | Shipments Bronze model             |
| warehouses\_ext           | Raw source system | EXTERNAL    | Warehouses Bronze model            |
| suppliers\_ext            | Raw source system | EXTERNAL    | Suppliers Bronze model             |
| dim\_dates\_ext           | Raw source system | EXTERNAL    | Dates Bronze model                 |
| order\_status\_lookup.csv | This data product | SEED        | Static order status classification |

***

## Part 2: Design

### The right-to-left approach

The MCP starts at the business question and works backwards to the data. Traditional design goes from "here are my tables, what can I build?" The MCP goes from "what decision needs to be made?" to "which source columns are required?"

```
Business Question
  -> Required Metric
    -> Required Measures + Dimensions
      -> Required Grain
        -> Required Joins
          -> Required Source Columns
```

Every model, every join, and every column ends up in the design because it was required by a specific business question.

### Opening the conversation

> **You:**\
> I want to build a data product.

> **Agent:**\
> Great. Let's figure out exactly what to build. I'll start with a few business questions before we touch any tables or schemas.
>
> First: what problem are you trying to solve? What's broken or missing today that this data product is supposed to fix?

### Q1: pain points and gaps

> **Agent asks:**\
> What pain points or gaps exist today that this data product is meant to address?

> **You:**\
> Honestly, every team is running their own numbers and they never match. Sales has one revenue figure, product has another. We have no agreed-on way to figure out which customers are slipping away. Everyone's just doing it differently in their own spreadsheets. And if someone wants to know where orders are dropping off between placement and shipment by region, they have to join five tables by hand every single time. It's slow, it's inconsistent, and nobody really trusts the output.

### Q2: use case statement

> **Agent asks:**\
> In one sentence, what does this data product do and what decisions does it support?

> **You:**\
> It takes our order, customer, product, and shipment data and gives us one trusted place to measure revenue, see which products are moving, understand customer behavior, and track how well we're fulfilling orders. So teams stop rebuilding the same joins every week and actually trust the numbers they use to make decisions.

### Q3: consumers

> **Agent asks:**\
> Who are the primary consumers: which teams, roles, or tools will use it, and what will they do with it?

> **You:**\
> Sales ops for daily and weekly revenue by region. The product team for category and product performance: best sellers, slow movers. Marketing for customer segments: who's a champion, who's at risk, who should get a win-back campaign. Fulfillment ops for order-to-shipment rate. And we want our AI assistant to be able to answer plain-English questions against this without someone writing SQL every time.

### Q4: key questions and metrics

> **Agent asks:**\
> What are the key questions this data product should answer, and what numbers or trends matter most?

> **You:**\
> Which regions are driving the most revenue this week? Are Electronics outperforming Home Goods? Who are our top customers and who's about to churn? What percentage of orders actually get shipped in each region? How's average order value trending? And for the retention team: which customers are Champions, which are At Risk, and what's the recommended action for each segment?

{% hint style="success" %}
After Q1 to Q4, the MCP creates `data-product-plan.md` with Section 1 (Business Context) filled in. Sections 2 to 16 are marked "To be defined." This file grows with every subsequent batch and becomes the confirmed spec the AI reads before writing any file.
{% endhint %}

***

## Part 3: Plan

With the business context confirmed, the MCP moves into structured planning. Three batches of questions lock the full spec before any file is generated.

### Batch 1: data and structure

#### Q5: engine and data source

> **Agent asks:**\
> Where is your data located? And which engine does this data product use?

> **You:**\
> It's all in PostgreSQL. We're using hosted serverless Postgres. The tables are already in the public schema as external tables, which we confirmed in Discover. Engine is Postgres.

#### Entity inference

> **Agent:**\
> Based on what you've described, the core entities seem to be: **Orders, Order Items, Customers, Products, Regions, Shipments, Warehouses,** and **Suppliers**. Does that sound right? Anything to add or remove?

> **You:**\
> Yes, that's right. Add two more: a **Dates** dimension table for calendar attributes, and an **Order Status** lookup. We have a seed CSV for that. It maps raw status codes to labels and marks which ones are revenue-bearing.

#### Table and model kind classification

> **Agent:**\
> Matching confirmed entities to tables from Discover. Here is my classification for each source:
>
> Does this look right? Any ownership or kind to change?

| Source                                               | Owned By          | Kind     | Reason                                                       |
| ---------------------------------------------------- | ----------------- | -------- | ------------------------------------------------------------ |
| orders\_ext, customers\_ext ... (all 9 ext tables)   | Raw source system | EXTERNAL | Read-only; this DP does not own or transform them            |
| order\_status\_lookup.csv                            | This data product | SEED     | We maintain this lookup; load as a seed model                |
| bronze.\*, silver.\*, gold.\* (models we will build) | This data product | FULL     | We own the transformation; FULL rebuilds cleanly on each run |

> **You:**\
> That's correct. The raw tables aren't ours so EXTERNAL makes sense. The status CSV we own and maintain, so SEED is right. Everything we build on top is FULL.

#### Join recommendations

> **Agent:**\
> Based on the fetched column lists, here are the suggested joins:
>
> Do these joins look correct? Anything missing or wrong?

| Join                    | Left Table          | Right Table       | Key           |
| ----------------------- | ------------------- | ----------------- | ------------- |
| Orders to Customers     | bronze.orders       | bronze.customers  | customer\_id  |
| Customers to Regions    | bronze.customers    | bronze.regions    | region\_id    |
| Order Items to Orders   | bronze.order\_items | bronze.orders     | order\_id     |
| Order Items to Products | bronze.order\_items | bronze.products   | product\_id   |
| Products to Suppliers   | bronze.products     | bronze.suppliers  | supplier\_id  |
| Orders to Warehouses    | bronze.orders       | bronze.warehouses | warehouse\_id |
| Warehouses to Regions   | bronze.warehouses   | bronze.regions    | region\_id    |
| Shipments to Orders     | bronze.shipments    | bronze.orders     | order\_id     |

> **You:**\
> All correct. One important note for revenue: always compute it as `order_items.unit_price × quantity` at the line-item level. Don't use any pre-aggregated column from the orders table. The authoritative revenue number comes from the items, not the order header.

#### Q6: population filters

> **Agent asks:**\
> What business rules filter which records are included before or during the join? If none, say "none."

> **You:**\
> Exclude cancelled orders from everything: revenue, order volume, RFM history, funnel metrics. The filter is `order_status <> 'Cancelled'`. Apply it in every model that touches revenue or customer spend. For shipments: use a LEFT JOIN so orders with no shipment still appear in order counts with a shipment rate of zero. Don't use an INNER JOIN or you'll drop unshipped orders from the denominator.

#### Measures, dimensions, and metrics

> **Agent:**\
> Based on your key questions, here are the recommended measures, dimensions, and metrics:
>
> RFM scoring uses static thresholds, not percentiles:

| Score | Recency (days since last order) | Frequency (order count) | Monetary (total revenue) |
| ----- | ------------------------------- | ----------------------- | ------------------------ |
| 5     | 30 days or fewer                | 4 or more orders        | $1,000 or more           |
| 4     | 60 days or fewer                | 3 or more orders        | $750 or more             |
| 3     | 90 days or fewer                | 2 or more orders        | $500 or more             |
| 2     | 180 days or fewer               | 1 or more orders        | More than $0             |
| 1     | More than 180 days              | 0 orders                | $0                       |

| Measure                              | Formula                                               | Aggregation    | Filter                                        |
| ------------------------------------ | ----------------------------------------------------- | -------------- | --------------------------------------------- |
| total\_daily\_revenue                | SUM(quantity x unit\_price)                           | SUM            | Non-cancelled orders                          |
| total\_daily\_orders                 | COUNT(DISTINCT order\_id)                             | COUNT DISTINCT | Non-cancelled orders                          |
| average\_order\_value                | total\_revenue / order\_count                         | Derived ratio  | Safe divide, 0 when denominator is 0          |
| average\_shipment\_rate              | shipped\_orders / total\_orders                       | Derived ratio  | Shipped = distinct orders with a shipment row |
| avg\_customer\_lifetime\_value       | AVG(customer total\_revenue)                          | AVG            | Per customer by segment                       |
| total\_rfm\_value                    | SUM(monetary\_value)                                  | SUM            | By RFM segment                                |
| at\_risk\_count                      | COUNT DISTINCT customer\_id WHERE segment = 'At Risk' | COUNT DISTINCT | At Risk segment only                          |
| avg\_order\_to\_shipment\_conversion | AVG(order\_to\_shipment\_rate)                        | AVG            | By region                                     |

> **You:**\
> Those measures look right. Two clarifications: for shipment rate, count shipped **orders**, not shipment rows. One order can have multiple shipments and we don't want to double-count. And yes, RFM scoring is fixed thresholds. Those buckets are the business's agreed-upon definitions, not something computed dynamically.

#### Grain definition

> **Agent:**\
> Based on the confirmed measures and dimensions, here is what one row represents in each model:

| Model                            | Grain                                         |
| -------------------------------- | --------------------------------------------- |
| silver.fct\_daily\_sales         | One order\_date x region x customer x product |
| silver.fct\_weekly\_sales        | One ISO week x region x category              |
| silver.dim\_customer\_profile    | One customer, lifetime purchase snapshot      |
| silver.dim\_product\_profile     | One product, catalog performance snapshot     |
| gold.rfm\_customer\_segmentation | One customer, current RFM score and segment   |
| gold.sales\_funnel\_analysis     | One funnel\_date x region                     |

> **You:**\
> Yes, that's right for all of them. The daily sales grain at customer x product level gives us the slice-and-dice flexibility we need for both product and customer analysis. The funnel snapshot by region is fine for the demo; in production we'd want a date-partitioned version.

#### Modeling approach

> **Agent:**\
> Based on your data structure (multiple operational sources, shared join logic across several gold models, and layered cleaning), I recommend **Medallion Architecture**. A single star schema wouldn't cleanly support both RFM segmentation and funnel analysis sharing the same Bronze joins.
>
> Which would you like to go with, Medallion or Star Schema?

| Layer              | Purpose                                                                              |
| ------------------ | ------------------------------------------------------------------------------------ |
| Bronze             | Source-aligned models reading from EXTERNAL tables. No business logic.               |
| Silver             | Cleaned and joined fact tables and dimension profiles. Shared join logic lives here. |
| Gold               | Business aggregates: RFM scoring and sales funnel conversion.                        |
| Semantic + Metrics | Queryable surface for AI agents, dashboards, and analysts.                           |

> **You:**\
> Medallion. The Bronze and Silver layers are shared inputs for both the RFM and funnel gold models. Duplicating all that join logic in a flat star schema would be messy. Medallion is the right call.

***

### Batch 2: delivery and freshness

#### Q7: consumption pattern

> **Agent asks:**\
> Where will this data product be consumed: AI agents, BI dashboards, API, ad-hoc queries, or embedded in an app?

> **You:**\
> Mostly AI agents through the Data Product MCP. We want people to ask plain-English questions and get answers without writing SQL. We also have BI dashboards pulling from the semantic layer, and analysts doing ad-hoc queries through Vulcan directly.

#### Q8: refresh cadence

> **Agent asks:**\
> How often does the data need to be refreshed: real-time, hourly, daily, weekly?

> **You:**\
> Every 15 minutes for this demo, just to show the pipeline actually runs. In production I'd move most models to daily. Sales ops checks numbers in the morning; they don't need sub-hourly. We can revisit if fulfillment ops wants more frequent updates later.

#### Q9: backfill

> **Agent asks:**\
> Do you need historical data loaded from the start, and if so, how far back?

> **You:**\
> Yes, one year. Our data generator creates 365 days of synthetic order history. Project start date is `2025-01-01`. For a real production deployment we'd backfill from the earliest order in the source system.

#### Assumptions review

Before moving to Build, the MCP lists every assumption it made during planning.

> **Agent:**\
> Here are the assumptions I applied. Please confirm or correct each one:
>
> * Cancelled orders are excluded from all revenue, CLV, RFM, and funnel metrics (`order_status <> 'Cancelled'`)
> * RFM scoring uses fixed business-rule thresholds, not dynamic NTILE percentiles
> * Sales funnel snapshot uses the current registered customer base aggregated by region, not a historical registration event stream
> * Revenue is computed as `unit_price x quantity` from order\_items, not from any pre-aggregated column on orders
> * 15-minute refresh is for demonstration; not a production SLA

> **You:**\
> All correct. The one I'd flag for future production work is the funnel snapshot. A proper cohort-based funnel would need the customer registration date as the anchor, not a current snapshot. But for this demo, snapshot by region is exactly what we want. Proceed.

{% hint style="success" %}
After Batch 2, `data-product-plan.md` is fully populated across all 16 sections. The MCP reads this file before writing any model, semantic YAML, or DQ rule. Every generated file is grounded in what you confirmed here.
{% endhint %}

***

### The spec: data-product-plan.md

This file is created after the Design phase and updated after each planning batch. It is the confirmed spec the MCP reads before generating any file. Because a human reviews and approves every section, the generated SQL, YAML, and DQ rules are grounded in confirmed facts, not column names the AI invented or business rules it assumed.

The file grows in three stages:

| Stage                    | Sections filled                                                                            |
| ------------------------ | ------------------------------------------------------------------------------------------ |
| After Design (Q1 to Q4)  | Section 1: Business Context                                                                |
| After Batch 1 (Q5 to Q6) | Sections 2 to 13: Data sources, entities, joins, measures, grain, model architecture       |
| After Batch 2 (Q7 to Q9) | Sections 14 to 16: Consumption, freshness, quality rules, AI context, validation checklist |

<details>

<summary>View data-product-plan.md</summary>

```
# Data Product Plan: orders-analytics
## Status: Validated  |  Created: 2025-01-01

## 1. Business Context
- Problem:    Teams compute revenue differently; no shared RFM logic;
              analysts join tables manually every time
- Use Case:   One governed place for revenue, product performance,
              customer retention, and fulfillment metrics on PostgreSQL
- Consumers:  Sales ops, merchandising, marketing, fulfillment ops,
              AI agents (MCP), BI dashboards
- Key Questions: Revenue by region/category, product velocity,
              RFM segments and recommended actions, order-to-shipment rate

## 2. Data Sources  |  Engine: postgres
Source                    Kind      Key Columns
orders_ext                EXTERNAL  order_id, customer_id, order_date, order_status
order_items_ext           EXTERNAL  order_id, product_id, quantity, unit_price
customers_ext             EXTERNAL  customer_id, region_id, name, email, signup_date
products_ext              EXTERNAL  product_id, supplier_id, name, category, price
regions_ext               EXTERNAL  region_id, region_name
shipments_ext             EXTERNAL  shipment_id, order_id, shipped_date
warehouses_ext            EXTERNAL  warehouse_id, region_id, name
suppliers_ext             EXTERNAL  supplier_id, region_id, name
dim_dates_ext             EXTERNAL  dt, year, month, day_of_week
order_status_lookup.csv   SEED      status_code, status_label, is_revenue_bearing

## 8. Grain
fct_daily_sales      -> order_date x region_id x customer_id x product_id
fct_weekly_sales     -> week x region_id x category
dim_customer_profile -> customer_id
dim_product_profile  -> product_id
rfm                  -> customer_id
sales_funnel         -> funnel_date x region_id

## 10. Consumption & Freshness
- Pattern:  AI agents (MCP), BI dashboards, ad-hoc analyst queries
- Cadence:  */15 * * * * (demo); daily in production
- Backfill: 2025-01-01 (365 days of synthetic history)

## 13. Model Architecture
Layer     Model                             Kind
EXTERNAL  public.*_ext (9 tables)           EXTERNAL
SEED      order_status_lookup               SEED
Bronze    bronze.orders ... bronze.dim_dates  FULL (x10)
Silver    fct_daily_sales, fct_weekly_sales,
          dim_customer_profile, dim_product_profile  FULL (x4)
Gold      rfm_customer_segmentation,
          sales_funnel_analysis             FULL (x2)
Semantic  daily_sales, weekly_sales,
          customer_profile, product_profile,
          rfm, sales_funnel                 semantic (x6)
Metrics   daily_sales_performance,
          weekly_revenue_trends,
          customer_lifetime_value,
          rfm_value_by_segment,
          fulfillment_conversion            metric (x5)

## 14. YAML Contract
name: orders-analytics
version: 0.1.2
engine: postgres
goal: Trusted e-commerce analytics for sales, product, retention, fulfillment
consumers: [sales-ops, merchandising, marketing, fulfillment-ops, ai-agents]
freshness:
  cadence: "*/15 * * * *"
  backfill: "2025-01-01"
consumption:
  pattern: ai-agents, bi-dashboard, ad-hoc
```

</details>

***

## Project initialization

Before the MCP generates any file, initialize your Vulcan project in a local directory. This is a one-time step that creates the folder structure and a default `config.yaml`.

{% stepper %}
{% step %}

#### Check the Vulcan CLI is available

```bash
vulcan --version
```

If the command fails, the CLI is not installed. Install it from the wheel scaffolded in [Step 1](#step-1-scaffold-the-skills-docs-and-cli-wheel) (`pip install "docs/vulcan-*.whl[<engine>]"` inside your Python 3.10 venv), or see [LDK Setup](/build/readme/ldk-setup.md) for full environment setup.
{% endstep %}

{% step %}

#### Initialize the project

Run `vulcan init` with the engine name from the spec. For `orders-analytics`:

```bash
vulcan init postgres
```

This creates the full project scaffold: `config.yaml`, `models/`, `models/semantics/`, `dq/`, `tests/`, `audits/`, `use_cases.yaml`, and `limitations.yaml`.
{% endstep %}

{% step %}

#### Configure the connection

Open `config.yaml` and fill in the warehouse connection settings. For the PostgreSQL instance used in this example:

```yaml
host: <your-pg-host>
port: 5432
database: warehouse
username: <your-username>
password: <your-password>
```

{% hint style="warning" %}
Confirm your connection before the MCP starts generating files. A missing or wrong connection will cause `vulcan evaluate` to fail on the first model.
{% endhint %}
{% endstep %}
{% endstepper %}

{% hint style="success" %}
Once `vulcan --version` works and `config.yaml` is configured, tell the MCP to proceed. It reads the spec and generates the full 43-file blueprint.
{% endhint %}

***

## Part 4: Build

With the spec confirmed, the MCP generates the full file manifest, then writes each file layer by layer. Every file traces back to a confirmed decision in `data-product-plan.md`.

The full 43-file manifest is in the [Appendix](#appendix-full-file-inventory). Files are generated in dependency order: External declarations, Seeds, Bronze, Silver, Gold, Semantics, Metrics, Quality.

### The build loop

For every component group the MCP generates, it instructs you to run 2 commands before moving on. This catches errors early, before they compound across later files.

**After each SQL model is written to disk:**

```bash
vulcan evaluate <model_name> --limit 10
```

This runs the model against the dev database and returns up to 10 rows. Check that column types, row counts, and aggregation values look correct. For example, after the silver model is generated:

```bash
vulcan evaluate silver.fct_daily_sales --limit 10
```

**After each component group (Bronze, Silver, Gold, Semantic, Metrics, Quality):**

```bash
vulcan plan dev --auto-apply
```

This materializes all models in the dev environment and runs blocking assertions. If it fails, the MCP reads the error message, analyzes the issue, applies the fix, and instructs you to re-run.

{% hint style="warning" %}
Always use `--auto-apply`. Without it, `vulcan plan` blocks waiting for interactive input and the agent cannot proceed.
{% endhint %}

### Bronze layer

Bronze models read directly from EXTERNAL tables. No business logic: just column selection and type casting. Each model has a single grain and is tagged for DQ coverage.

<details>

<summary>Bronze model: bronze.orders</summary>

```sql
MODEL (
  name   bronze.orders,
  kind   FULL,
  cron   '*/15 * * * *',
  owner  'johndoetmdcio',
  grains [order_id],
  description 'Source-aligned orders from orders_ext.'
);

SELECT
  order_id,
  customer_id,
  order_date,
  warehouse_id,
  order_status
FROM "warehouse"."public"."orders_ext";
```

</details>

### Silver layer

Silver models join Bronze sources and apply the shared business logic. The `@revenue_order_filter` macro (which excludes cancelled orders) is applied here. Silver is where the Bronze joins live so Gold models don't repeat them.

<details>

<summary>Silver model: silver.fct_daily_sales</summary>

```sql
MODEL (
  name   silver.fct_daily_sales,
  kind   FULL,
  grains [order_date, region_id, customer_id, product_id],
  description 'Daily sales fact by date, region, customer, and product.'
);

WITH order_metrics AS (
  SELECT
    o.order_date::DATE         AS order_date,
    c.region_id,
    r.region_name,
    o.customer_id,
    oi.product_id,
    p.category,
    COUNT(DISTINCT o.order_id)        AS total_orders,
    SUM(oi.quantity)                  AS total_items_sold,
    SUM(oi.quantity * oi.unit_price)  AS total_revenue
  FROM bronze.orders      AS o
  JOIN bronze.customers   AS c  ON o.customer_id = c.customer_id
  JOIN bronze.regions     AS r  ON c.region_id   = r.region_id
  JOIN bronze.order_items AS oi ON o.order_id    = oi.order_id
  JOIN bronze.products    AS p  ON oi.product_id = p.product_id
  WHERE @revenue_order_filter(o.order_status)   -- resolves to: order_status <> 'Cancelled'
  GROUP BY 1, 2, 3, 4, 5, 6
),
shipment_metrics AS (
  -- counts distinct shipped orders per grain; separate CTE to avoid row duplication
  ...
)
SELECT
  om.*,
  @safe_ratio(om.total_revenue, om.total_orders, 2)                AS avg_order_value,
  COALESCE(sm.total_shipments, 0)                                  AS total_shipments,
  @safe_ratio(COALESCE(sm.total_shipments, 0), om.total_orders, 4) AS shipment_rate
FROM order_metrics AS om
LEFT JOIN shipment_metrics AS sm    -- LEFT JOIN keeps unshipped orders in the denominator
  ON  om.order_date  = sm.order_date
  AND om.region_id   = sm.region_id
  AND om.customer_id = sm.customer_id
  AND om.product_id  = sm.product_id;
```

</details>

### Gold layer

Gold models contain the complex business logic built on Silver. RFM scoring uses the fixed thresholds confirmed in Plan. Segment labels and recommended actions are CASE-statement rules.

<details>

<summary>Gold model: gold.rfm_customer_segmentation</summary>

```sql
MODEL (
  name   gold.rfm_customer_segmentation,
  kind   FULL,
  grains [customer_id],
  description 'RFM scores and recommended retention actions per customer.',
  assertions (
    unique_values(columns := (customer_id)),
    not_null(columns := (customer_id, customer_name, email, rfm_segment)),
    accepted_values(column := rfm_segment,
      is_in := ('Champions', 'Loyal Customers', 'Potential Loyalists',
                'Recent Customers', 'Promising', 'Customers Needing Attention',
                'At Risk', 'Hibernating', 'Lost')),
    rfm_score_consistency()
  )
);

WITH rfm_base AS (
  SELECT
    customer_id, customer_name, email, region_name,
    last_order_date,
    days_since_last_order  AS recency_days,
    total_orders           AS frequency_orders,
    total_revenue          AS monetary_value,
    CASE
      WHEN days_since_last_order <= 30  THEN 5
      WHEN days_since_last_order <= 60  THEN 4
      WHEN days_since_last_order <= 90  THEN 3
      WHEN days_since_last_order <= 180 THEN 2
      ELSE 1
    END AS recency_score,
    CASE
      WHEN total_orders >= 4 THEN 5
      WHEN total_orders >= 3 THEN 4
      WHEN total_orders >= 2 THEN 3
      WHEN total_orders >= 1 THEN 2
      ELSE 1
    END AS frequency_score,
    CASE
      WHEN total_revenue >= 1000 THEN 5
      WHEN total_revenue >= 750  THEN 4
      WHEN total_revenue >= 500  THEN 3
      WHEN total_revenue > 0     THEN 2
      ELSE 1
    END AS monetary_score
  FROM silver.dim_customer_profile
)
SELECT
  *,
  recency_score::TEXT || frequency_score::TEXT || monetary_score::TEXT AS rfm_score,
  CASE
    WHEN recency_score >= 4 AND frequency_score >= 4 AND monetary_score >= 4 THEN 'Champions'
    WHEN recency_score >= 3 AND frequency_score >= 4                         THEN 'Loyal Customers'
    WHEN recency_score >= 4 AND frequency_score >= 2                         THEN 'Potential Loyalists'
    WHEN recency_score >= 4                                                  THEN 'Recent Customers'
    WHEN recency_score >= 3 AND monetary_score >= 2                          THEN 'Promising'
    WHEN recency_score = 3                                                   THEN 'Customers Needing Attention'
    WHEN recency_score <= 2 AND frequency_score >= 3 AND monetary_score >= 3 THEN 'At Risk'
    WHEN recency_score <= 2 AND frequency_score <= 2                         THEN 'Hibernating'
    ELSE 'Lost'
  END AS rfm_segment,
  CASE
    WHEN recency_score >= 4 AND frequency_score >= 4 AND monetary_score >= 4 THEN 'Reward with VIP offers and early access'
    WHEN recency_score <= 2 AND frequency_score >= 3 AND monetary_score >= 3 THEN 'Run high-priority win-back campaign'
    WHEN recency_score <= 2 AND frequency_score <= 2                         THEN 'Use low-cost reactivation'
    -- recommended_action for all 9 segments follows same pattern
  END AS recommended_action
FROM rfm_base;
```

</details>

### Semantic layer

Semantic YAMLs expose Silver and Gold models as queryable objects. Each model declares measures, dimensions, segments, joins, and `ai_context` so AI agents and BI tools know how to interpret and query it.

<details>

<summary>Semantic model: daily_sales.yml</summary>

```yaml
kind: semantic
name: daily_sales
depends_on: silver.fct_daily_sales
description: Daily sales performance by customer, product, category, and region.

ai_context:
  instructions: >
    Daily sales semantic model for revenue, order volume, item volume,
    average order value, and shipment-rate analysis. One row = one
    order_date x region x customer x product combination.
  synonyms:
    - daily revenue
    - sales performance
    - order volume
    - shipment rate
  examples:
    - description: daily revenue by region
      query: |
        SELECT daily_sales.region_name,
               MEASURE(daily_sales.total_daily_revenue)
        FROM daily_sales GROUP BY 1;

dimensions:
  - order_date
  - region_name
  - customer_id
  - category

measures:
  - name: total_daily_revenue
    type: sum
    expression: "{daily_sales.total_revenue}"
    description: Total daily gross revenue (non-cancelled orders)

  - name: average_order_value
    type: avg
    expression: "{daily_sales.avg_order_value}"
    description: Average order value, revenue divided by order count

  - name: average_shipment_rate
    type: avg
    expression: "{daily_sales.shipment_rate}"
    description: Avg shipment rate, shipped orders divided by total orders

segments:
  - name: high_revenue_days
    expression: "{daily_sales.total_revenue} >= 500"
    description: Daily rows with total revenue at or above 500
```

</details>

### Metrics

Metric YAMLs bind a measure to a time dimension, creating a tracked business time series.

<details>

<summary>Metric: daily_sales_performance.yml</summary>

```yaml
kind: metric
name: daily_sales_performance
measure: daily_sales.total_daily_revenue
ts: daily_sales.order_date
granularity: day

dimensions:
  - name: region
    ref: daily_sales.region_name
  - name: category
    ref: daily_sales.category

segments:
  - daily_sales.high_revenue_days

description: Daily revenue trends by region and category.
tags:
  - sales
  - revenue
  - daily
```

</details>

### Quality: DQ rules, assertions, and tests

**DQ rules** (`dq/`) are non-blocking monitors. They run after each model and surface issues without stopping the run.

<details>

<summary>DQ rule: fct_daily_sales_dq.yml</summary>

```yaml
kind: dq
name: fct_daily_sales_dq
depends_on: silver.fct_daily_sales

rules:
  - name: no_null_order_date
    type: not_null
    column: order_date
    blocking: true

  - name: revenue_non_negative
    type: range
    column: total_revenue
    min: 0
    blocking: false

  - name: shipment_rate_in_bounds
    type: range
    column: shipment_rate
    min: 0
    max: 1
    blocking: false

  - name: freshness
    type: freshness
    column: order_date
    warn_after: 1d
    error_after: 2d
```

</details>

**Assertions** (`audits/`) are blocking SQL checks. If they fail, the run stops.

<details>

<summary>Assertion: daily_sales_metric_consistency.sql</summary>

```sql
-- audits/daily_sales_metric_consistency.sql
-- Asserts that computed revenue matches sum of quantity x unit_price at the grain.
-- Returns 0 rows on pass.
SELECT
  f.order_date,
  f.region_id,
  f.customer_id,
  f.product_id,
  f.total_revenue       AS model_revenue,
  source.expected_revenue
FROM silver.fct_daily_sales AS f
JOIN (
  SELECT
    o.order_date::DATE  AS order_date,
    c.region_id,
    o.customer_id,
    oi.product_id,
    SUM(oi.quantity * oi.unit_price) AS expected_revenue
  FROM bronze.order_items AS oi
  JOIN bronze.orders    AS o  ON oi.order_id   = o.order_id
  JOIN bronze.customers AS c  ON o.customer_id = c.customer_id
  WHERE o.order_status <> 'Cancelled'
  GROUP BY 1, 2, 3, 4
) AS source USING (order_date, region_id, customer_id, product_id)
WHERE ABS(f.total_revenue - source.expected_revenue) > 0.01;
```

</details>

**Unit tests** (`tests/`) validate specific model behavior against expected outputs.

<details>

<summary>Unit tests: test_rfm_customer_segmentation.yaml</summary>

```yaml
# tests/test_rfm_customer_segmentation.yaml
- input:
    customer_id: 1
    days_since_last_order: 25
    total_orders: 5
    total_revenue: 1200
  expected:
    recency_score: 5
    frequency_score: 5
    monetary_score: 5
    rfm_segment: Champions

- input:
    customer_id: 2
    days_since_last_order: 200
    total_orders: 4
    total_revenue: 800
  expected:
    recency_score: 1
    frequency_score: 5
    monetary_score: 4
    rfm_segment: At Risk
```

</details>

***

## Part 5: Validate

Run the standard Vulcan command sequence to confirm everything resolves and executes correctly.

### vulcan info

Resolves your project configuration, validates the model DAG, and lists all models. Run this first. Syntax errors in YAML or broken `depends_on` references show up here before any SQL runs.

<details>

<summary>vulcan info: sample output</summary>

```
Project: orders-analytics  v0.1.2
Engine:  postgres (dataos://postgresDepot)
Models:  43 resolved
  External (9):   orders_ext, customers_ext, products_ext,
                  order_items_ext, shipments_ext, regions_ext,
                  warehouses_ext, suppliers_ext, dim_dates_ext
  Seeds    (1):   order_status_lookup
  Bronze  (10):   orders, customers, products, order_items, shipments,
                  regions, warehouses, suppliers, dim_dates
  Silver   (4):   fct_daily_sales, fct_weekly_sales,
                  dim_customer_profile, dim_product_profile
  Gold     (2):   rfm_customer_segmentation, sales_funnel_analysis
  Semantic (6):   daily_sales, weekly_sales, customer_profile,
                  product_profile, rfm, sales_funnel
  Metrics  (5):   daily_sales_performance, weekly_revenue_trends,
                  customer_lifetime_value, rfm_value_by_segment,
                  fulfillment_conversion
Linter: 0 errors, 0 warnings
DAG:    No cycles detected
```

</details>

{% hint style="info" %}
What to look for: all 43 models resolved, linter at 0 errors, no cycles. If a model is missing, the YAML has a syntax error or `depends_on` points to a nonexistent model.
{% endhint %}

### vulcan plan dev --auto-apply

Executes the full model DAG in the dev environment, materializes all Bronze, Silver, and Gold tables in dependency order, and runs all blocking assertions. This is the full-project confirmation run: by this point each component group has already passed individually during the build loop.

```bash
vulcan plan dev --auto-apply
```

Confirm execution order is Bronze before Silver, Silver before Gold, then Assertions. EXTERNAL models should not appear in the plan since they are read-only references.

<details>

<summary>vulcan plan dev --auto-apply: sample output</summary>

```shellscript
[run] Starting orders-analytics

[bronze] orders ...................... OK  (312ms)
[bronze] customers ................... OK  (198ms)
[bronze] products .................... OK  (156ms)
[bronze] order_items ................. OK  (421ms)
[bronze] shipments ................... OK  (187ms)
[bronze] regions, warehouses, suppliers, dim_dates ... OK
[seed]   order_status_lookup ......... OK   (89ms)
[silver] fct_daily_sales ............. OK  (1.2s)
[silver] fct_weekly_sales ............ OK  (0.8s)
[silver] dim_customer_profile ........ OK  (0.7s)
[silver] dim_product_profile ......... OK  (0.5s)
[gold]   rfm_customer_segmentation ... OK  (1.1s)
[gold]   sales_funnel_analysis ....... OK  (0.9s)
[assert] unique_values(customer_id) on rfm ......... PASS
[assert] not_null on fct_daily_sales ............... PASS
[assert] rfm_score_consistency ..................... PASS
[assert] daily_sales_metric_consistency ............ PASS
[assert] order_status_lookup_consistency ........... PASS

[run] Completed in 8.4s  16 models, 5 assertions, 0 failures
```

</details>

### vulcan test

Executes unit-style model tests. These validate specific aggregation and RFM scoring behavior.

<details>

<summary>vulcan test: sample output</summary>

```shellscript
[test] test_fct_daily_sales
  ok  daily revenue = sum of quantity x unit_price
  ok  cancelled orders excluded from revenue
  ok  shipment_rate = shipped_orders / total_orders
  ok  shipment_rate is 0.0 when no shipments exist for the order

[test] test_rfm_customer_segmentation
  ok  recency_score = 5 when days_since_last_order = 25
  ok  recency_score = 2 when days_since_last_order = 150
  ok  rfm_segment = 'Champions'   for scores (5, 5, 5)
  ok  rfm_segment = 'At Risk'     for scores (2, 4, 4)
  ok  rfm_segment = 'Hibernating' for scores (1, 1, 1)
  ok  recommended_action populated for every segment

10 tests passed, 0 failed
```

</details>

### Final checklist

| Check                                  | Verified by                                        | Phase           |
| -------------------------------------- | -------------------------------------------------- | --------------- |
| Goal and consumers confirmed           | Section 1 of data-product-plan.md                  | Design          |
| Data sources accessible                | vulcan info, no connection errors                  | Validate        |
| Grain defined for all models           | Section 8 of spec, no UNKNOWN entries              | Plan            |
| Measures vs metrics distinction clear  | Semantic YAMLs and metric YAMLs match spec         | Build           |
| Joins documented and tested            | Section 4 of spec and vulcan run assertions        | Plan / Validate |
| Cancelled orders excluded from revenue | @revenue\_order\_filter in all Silver models       | Build           |
| RFM scoring uses confirmed thresholds  | Gold model matches Section 6 of spec               | Build           |
| All assertions pass                    | vulcan run, 0 assertion failures                   | Validate        |
| All unit tests pass                    | vulcan test, 0 failures                            | Validate        |
| DQ rules applied to analytics models   | One dq/\*.yml per Silver and Gold model            | Build           |
| AI context drafted and confirmed       | Section 15.5 of spec reflected in semantic YAMLs   | Build           |
| Spec complete across all 16 sections   | data-product-plan.md, no "To be defined" remaining | Plan            |

***

## Part 6: Deploy

With validation complete in dev, deploy the data product to production and register it with DataOS for scheduled runs.

### vulcan plan prod --auto-apply

Promotes the data product to the production environment.

```bash
vulcan plan prod --auto-apply
```

{% hint style="info" %}
**`FULL` model materialization**: `FULL` models do not materialize immediately during `vulcan plan prod --auto-apply`. The plan registers them and the actual table build happens on the first scheduled cron run. You will see `SKIP: No model batches to execute` for `FULL` models: this is expected, not an error.

To force an immediate full refresh without waiting for the cron schedule:

```bash
vulcan run prod --select bronze.orders silver.fct_daily_sales gold.rfm_customer_segmentation
```

For the initial historical backfill from the spec's start date (`2025-01-01`):

```bash
vulcan plan prod \
  --restate-model silver.fct_daily_sales \
  --start 2025-01-01 \
  --end 2025-12-31 \
  --auto-apply
```

Run a separate `--restate-model` command for each model that needs backfilling.
{% endhint %}

### Verify in production

After the prod plan succeeds, confirm the data is correct with a direct query:

```bash
vulcan fetchdf "SELECT * FROM silver.fct_daily_sales LIMIT 10"
```

Verify the semantic layer with the transpiler. This is the first point at which the semantic layer is fully exposed: run this check here, not during dev verification.

```bash
vulcan transpile --format sql "SELECT MEASURE(daily_sales.total_daily_revenue) FROM daily_sales"
```

A clean transpile output confirms the semantic model and measure names are wired correctly and the AI agent can query this data product.

### Register with DataOS

Generate the deployment manifest so DataOS can schedule and monitor the data product:

```bash
vulcan create_deploy_yaml
```

This creates `deploy.yaml` in the project root. Open the file and fill in the placeholders:

| Field                             | Value for orders-analytics                                           |
| --------------------------------- | -------------------------------------------------------------------- |
| `name`                            | `orders-analytics`                                                   |
| `spec.runAsUser`                  | Your DataOS username                                                 |
| `spec.engine`                     | `postgres`                                                           |
| `spec.repo.url`                   | Your git repository URL                                              |
| `spec.repo.syncFlags`             | `--ref=main`                                                         |
| `spec.repo.baseDir`               | Path to the project folder within the repo                           |
| `spec.depots`                     | `dataos://postgresDepot?purpose=rw`                                  |
| `spec.workflow.schedule.crons`    | `*/15 * * * *` (demo): change to `0 6 * * *` for daily in production |
| `spec.workflow.schedule.timezone` | Your timezone, e.g. `US/Pacific`                                     |

Apply the manifest to DataOS:

```bash
dataos-ctl resource apply -f deploy.yaml
```

{% hint style="success" %}
Once `dataos-ctl apply` succeeds, the data product is registered with DataOS. It refreshes on the cron schedule, publishes to the Data Product Hub, and is immediately queryable through the Data Product MCP.
{% endhint %}

***

## Appendix: full file inventory

<details>

<summary>View all 43 files</summary>

| File                                              | Layer     | Kind      | Grain / Spec Decision                                     |
| ------------------------------------------------- | --------- | --------- | --------------------------------------------------------- |
| external\_models.yaml                             | External  | EXTERNAL  | 9 raw tables, confirmed in Discover                       |
| seeds/order\_status\_lookup.csv                   | Seed      | SEED      | Maintained by this DP, confirmed as SEED in Plan          |
| models/seeds/order\_status\_lookup.sql            | Seed      | SEED      | Loads CSV into seed table                                 |
| models/bronze/orders.sql                          | Bronze    | FULL      | grain: order\_id                                          |
| models/bronze/customers.sql                       | Bronze    | FULL      | grain: customer\_id                                       |
| models/bronze/products.sql                        | Bronze    | FULL      | grain: product\_id                                        |
| models/bronze/order\_items.sql                    | Bronze    | FULL      | grain: order\_id, item\_id                                |
| models/bronze/shipments.sql                       | Bronze    | FULL      | grain: shipment\_id                                       |
| models/bronze/regions.sql                         | Bronze    | FULL      | grain: region\_id                                         |
| models/bronze/warehouses.sql                      | Bronze    | FULL      | grain: warehouse\_id                                      |
| models/bronze/suppliers.sql                       | Bronze    | FULL      | grain: supplier\_id                                       |
| models/bronze/dim\_dates.sql                      | Bronze    | FULL      | grain: dt                                                 |
| models/silver/fct\_daily\_sales.sql               | Silver    | FULL      | grain: order\_date x region x customer x product          |
| models/silver/fct\_weekly\_sales.sql              | Silver    | FULL      | grain: week x region x category                           |
| models/silver/dim\_customer\_profile.sql          | Silver    | FULL      | grain: customer\_id, lifetime snapshot                    |
| models/silver/dim\_product\_profile.sql           | Silver    | FULL      | grain: product\_id, catalog snapshot                      |
| models/gold/rfm\_customer\_segmentation.sql       | Gold      | FULL      | grain: customer\_id, static threshold RFM scoring         |
| models/gold/sales\_funnel\_analysis.sql           | Gold      | FULL      | grain: funnel\_date x region\_id                          |
| models/semantics/daily\_sales.yml                 | Semantic  | semantic  | Queryable daily sales with ai\_context                    |
| models/semantics/weekly\_sales.yml                | Semantic  | semantic  | Weekly revenue semantic model                             |
| models/semantics/customer\_profile.yml            | Semantic  | semantic  | CLV, segments, retention targeting                        |
| models/semantics/product\_profile.yml             | Semantic  | semantic  | Velocity, category mix, performance tiers                 |
| models/semantics/rfm\_customer\_segmentation.yml  | Semantic  | semantic  | RFM with customer\_profile join                           |
| models/semantics/sales\_funnel\_analysis.yml      | Semantic  | semantic  | Funnel with joins to daily\_sales                         |
| models/metrics/daily\_sales\_performance.yml      | Metric    | metric    | total\_daily\_revenue over order\_date                    |
| models/metrics/weekly\_revenue\_trends.yml        | Metric    | metric    | Weekly revenue over time                                  |
| models/metrics/customer\_lifetime\_value.yml      | Metric    | metric    | avg CLV by segment and region                             |
| models/metrics/rfm\_value\_by\_segment.yml        | Metric    | metric    | Monetary value by RFM segment                             |
| models/metrics/fulfillment\_conversion.yml        | Metric    | metric    | Order-to-shipment rate by region                          |
| dq/fct\_daily\_sales.yml                          | DQ        | dq        | Nulls, value ranges, freshness, non-blocking monitors     |
| dq/fct\_weekly\_sales.yml                         | DQ        | dq        | Weekly sales monitors                                     |
| dq/dim\_customer\_profile.yml                     | DQ        | dq        | Customer completeness checks                              |
| dq/dim\_product\_profile.yml                      | DQ        | dq        | Product profile checks                                    |
| dq/rfm\_customer\_segmentation.yml                | DQ        | dq        | Segment accepted values and counts                        |
| dq/referential\_integrity.yml                     | DQ        | dq        | Cross-table key consistency                               |
| assertions/daily\_sales\_metric\_consistency.sql  | Assertion | assertion | revenue = qty x price, blocking                           |
| assertions/rfm\_score\_consistency.sql            | Assertion | assertion | Scores in valid range, blocking                           |
| assertions/order\_status\_lookup\_consistency.sql | Assertion | assertion | Every status in lookup, blocking                          |
| macros/orders\_helpers.py                         | Macro     | macro     | safe\_ratio and revenue\_order\_filter macros             |
| tests/test\_fct\_daily\_sales.yaml                | Test      | test      | Aggregation, filter, and rate tests                       |
| tests/test\_rfm\_customer\_segmentation.yaml      | Test      | test      | RFM scoring boundary tests                                |
| config.yaml                                       | Config    | config    | Project metadata, linter rules, schedule, users, gateways |
| usage.yaml                                        | Config    | config    | Consumer guidance, caveats, references                    |
| linter/linters.py                                 | Config    | linter    | Custom lint rules: grain, owner, DQ coverage              |

</details>


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://v2.dataos.info/build/get-started/build-journey-with-ai.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
