> For the complete documentation index, see [llms.txt](https://v2.dataos.info/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://v2.dataos.info/concepts/foundations/data-product.md).

# Data product

Enterprise AI systems and human analysts fail for the same reason: data lacks sufficient context to be used reliably.

An AI agent asked, "What was revenue growth last quarter?" cannot know which of three tables named something like "revenue" is the canonical one, whether the fiscal calendar matches its assumption, or whether the pipeline feeding the freshest-looking source failed four days ago. It reads, infers, and returns a confident answer that is wrong. A human analyst working the same warehouse faces the same problems, compensating with institutional knowledge and colleague conversations an agent cannot replicate.

The root cause is not a technology gap. It is a discipline gap. Data is stored because systems generate it, not designed because someone needs it. It is queryable without being understandable. Three teams define the same metric three different ways, and no one is accountable for the disagreement.

A **Data Product** is the structural answer.

"*A Data Product is a self-contained, managed unit of data treated like a product, with named owners, defined consumers (human and machine), service-level agreements, and a deliberately managed lifecycle. It is NOT a dashboard, a pipeline, or a raw table.*"

It is data plus the discipline that makes data reliably usable for analysts and agents alike.

A Data Product carries a contract describing what it guarantees, and it has a named owner accountable for keeping those guarantees. Everything a human analyst would need to trust and use the data, including definitions, quality signals, access rules, and lineage, is embedded in the product itself rather than scattered across wikis, tickets, and tribal knowledge.

The shift from treating data as infrastructure to treating data as a product is the central idea behind modern data management, and it is the foundation on which DataOS is built.

## Core properties

Seven properties define a Data Product. These are not optional features; they are the conditions that distinguish a governed data asset from an arbitrary table or pipeline output.

1. **Purpose-built:** A Data Product solves a specific analytical, operational, or AI need. It is not data in general. It exists to answer a bounded set of questions or support a defined set of decisions.
2. **Owned:** A named owner is accountable for its quality, evolution, and deprecation. Diffuse ownership, such as "the data team," is not ownership.
3. **Discoverable and addressable:** It is findable in a catalog with clear descriptions, tags, and business terms. It has a stable, unique identifier that allows it to be referenced and accessed consistently by applications, tools, and agents without requiring human assistance.
4. **Trustworthy:** Quality is not assumed; it is continuously monitored. Published, machine-readable guarantees on freshness, accuracy, completeness, and availability let downstream jobs and AI agents gate on them programmatically rather than discovering problems after the fact.
5. **Self-describing:** Schema, semantics, lineage, and usage examples travel with the data. Both people and language models can reason about the data correctly because context is carried by the asset itself, not scattered elsewhere.
6. **Natively governed:** Access control, compliance rules, PII handling, and data classification are built in from the start. Governance is not added after problems arise; it is a property of how the product is created and maintained.
7. **Interoperable and versioned:** A Data Product uses standard formats and shared identifiers so it composes cleanly with other products. Schema and semantic changes are managed like API changes: versioned, deprecated with notice, and never silently broken.

These properties are what separate a Data Product from data that is merely stored or moved.

## What a Data Product is not

* A raw table loaded into a warehouse
* A one-off extract or a dashboard
* A pipeline (the pipeline is plumbing; the product is the output plus its contract)
* A virtual view over sources the producing team does not control
* A vector index built on undefined source data

A dashboard is a view of a product, not the product itself. A pipeline is a mechanism; without a contract and ownership of the output, it is infrastructure built for nobody.

## Why Data Products are necessary

Most organizations that invest heavily in data infrastructure still struggle to get reliable value from their data. The problem is not a shortage of technology. It is a lack of product discipline applied to data. The failures follow a consistent pattern.

**Discovery failure:** Finding the right data requires institutional knowledge: which team built which pipeline, which table holds the current version, who to ask for access. Without discoverable, self-describing products, every new consumer starts the same search from scratch.

**Trust failure:** Business users receive reports and ask whether the numbers can be trusted. Stale data, undocumented transformations, and inconsistent definitions across teams erode confidence. When quality is measurable, monitored, and SLA-bound, trust is built on evidence rather than assumption.

**Reusability failure:** When a new use case appears, teams rebuild pipelines from scratch even when the data they need already exists. A properly built Data Product serves multiple consumers and use cases without rework because reusability is a design principle, not an accident.

**Accountability failure:** When data is wrong, responsibility is diffuse. The pipeline team blames the source system, the source system team blames the business requirements, and the business blames IT. A Data Product has a named owner who is accountable for its quality, availability, and ongoing value.

**Alignment failure:** Data teams build what they think the business needs rather than what it actually needs. A Data Product starts with a business outcome and works backward to determine what data, transformations, and quality rules are required. The business question drives the design.

The table below summarizes the contrast between the two approaches.

|                    | Data as Infrastructure        | Data as Product                                            |
| ------------------ | ----------------------------- | ---------------------------------------------------------- |
| Mindset            | Store it; someone will use it | Design it for a specific consumer and outcome              |
| Ownership          | IT or data engineering team   | Named product owner accountable for value                  |
| Quality            | Checked ad hoc or reactively  | Built in, continuously monitored, SLA-bound                |
| Documentation      | Sparse, often outdated        | Self-describing with rich metadata, schemas, and contracts |
| Governance         | Bolted on after the fact      | Native, enforced from creation                             |
| Lifecycle          | Build and forget              | Versioned, iterated, monitored, eventually retired         |
| Measure of success | Pipeline ran successfully     | Business decision was improved                             |

Closing that gap requires building differently from the start.

## Design principles

Seven principles guide how a Data Product is built. Each applies software engineering discipline to data: specifications before code, API-style versioning, tests that block bad releases, deliberate design, and measurable outcomes.

**1. Start with a product spec, not a pipeline:** Every Data Product begins with a written spec: who consumes it, what decisions it enables, the questions it must answer, its SLAs, and what is explicitly out of scope. No spec means no product. This discipline prevents the sprawl and rework that plagues data teams.

**2. Model the domain deliberately:** Data modeling is the core craft, not an afterthought. Entities, grains, relationships, and conformed dimensions must be designed, not inherited from whatever shape the source system produced. A good model makes the product intuitive to query, cheap to evolve, and safe to compose. A bad model is a tax every consumer pays indefinitely.

**3. Contract-first, implementation-second:** Schema, semantics, and SLAs are declared and published before the pipeline is built. The contract is the product; the pipeline is swappable plumbing. Consumers bind to the logical interface, never to physical storage paths.

**4. Fit-for-purpose scope with one clear owner:** A Data Product answers a bounded set of questions well, not every possible question poorly. One team owns it end-to-end: definition, quality, evolution, and deprecation. Shared ownership means no ownership; unbounded scope means no SLA can hold.

**5. Quality is in-band, not after the fact:** Validation runs as part of the build and gates publication. Bad data is blocked, not alerted on. Trust signals (freshness, completeness, lineage) are emitted as first-class, machine-readable outputs so downstream consumers can gate on them too.

**6. Versioned and backward-compatible by default:** Schema and semantic changes follow an API-style lifecycle: versioned, deprecated, and sunset. Consumers, including production models, are never surprised by silent breakage.

**7. Measured like a product:** Adoption, usage, cost-to-serve, and business outcomes are tracked. Products that drive decisions get invested in; products nobody uses get retired. Without this feedback loop, the team is building artifacts, not products.

## The Data Product lifecycle

A Data Product is not a one-time deliverable. Like a software product, it has a lifecycle with defined phases, each with clear activities, outputs, and responsible roles.

### Ideation and scoping

The lifecycle begins with a business problem. Before any technical work starts, the team identifies what business question or decision the Data Product will support; who the target consumers are; what the expected business outcome is and how it will be measured; and whether the necessary data exists and is accessible. This phase produces a business problem statement, a target consumer profile, and a preliminary feasibility assessment.

### Design and definition

Once the business need is clear, the team translates it into a specification. This includes defining the product boundary (what is in scope and out of scope), identifying source systems, defining the data model and transformation logic, establishing data contracts with input and output schemas, defining quality rules and SLA thresholds, specifying governance requirements, and designing the semantic model. The result is a formal specification that serves as the blueprint for the build phase.

### Build

The build phase implements the specification. Teams establish connections to source systems, build ingestion and transformation pipelines, implement quality checks, construct the semantic model, register the product in the catalog with metadata and documentation, and publish it.

### Deploy

Deployment releases the Data Product to its intended consumers. The team validates the product against the acceptance criteria, conducts consumer acceptance testing, configures the access interfaces (BI connectors, APIs, SQL query endpoints), and grants access through governance policies.

### Operate and govern

Once in production, the product requires ongoing operations: monitoring pipeline health, freshness, and quality SLAs; tracking usage patterns; enforcing governance policies; responding to quality failures or pipeline breaks; and managing compute and storage costs.

### Evolve and iterate

A Data Product is version 1 when it first ships. Usage is monitored, consumer feedback is collected, and the product is continuously improved. New fields are added, quality rules are tightened, and access patterns are optimized. Each iteration is a versioned release with release notes communicated to consumers.

### Retire

When a Data Product is no longer needed, relevant, or cost-justified (declining usage, superseded by a better product, or no longer aligned with business needs), it is formally decommissioned. Consumers are notified, migration guidance is provided, and pipelines and catalog entries are archived or removed.

Running through every phase of this lifecycle is a single foundational question: where does the data physically live, and who owns it?

## Grounding a Data Product

The engine beneath a Data Product is an important decision, but not the first decision. Picking it up front is what causes the "federation vs materialization" debate, which usually misses the point. The right order is: **ownership first, consumption pattern second, and engine last.** When worked in that order, most of the argument disappears.

### Ownership

A Data Product makes promises to a consumer: a schema, an SLA, a freshness guarantee, and a versioning policy. Those promises can only be kept if the producing team owns a versioned artifact that the contract binds to. Without that artifact, there is nothing durable beneath the contract.

The axis that matters is not federation vs. materialization. It is **ownership of a versioned artifact.** A query engine running over a table your team owns, versions, and audits is a perfectly good foundation. A query engine translating across systems you do not control at runtime is not. The test is simple: is there a versioned, owned artifact the contract is bound to? If yes, there is a product. If no, there is a query alias dressed up in product language.

**Illustration:** Consider two architectures using the same query engine (Trino):

* **Iceberg owned by one team, Trino as the query engine.**\
  The artifact beneath is durable and under deliberate control. Schema changes go through a release cycle; lineage is intrinsic; time travel is supported. The contract sits on something real.
* **Trino federates across Iceberg, Postgres, and MongoDB at runtime.**\
  No single team owns the composite. Schema drift, connection instability, and partition changes from any of the three sources leak through to every consumer. Nothing durable sits beneath the contract.

Same query engine. Totally different products.

```
┌─────────────────────────────────────────────────────────────┐
│  MATERIALIZED DATA PRODUCT                                  │
│                                                             │
│     Consumer                                                │
│        │                                                    │
│        ▼                                                    │
│      Trino         ◄──  query engine                        │
│        │                                                    │
│        ▼                                                    │
│    ┌─────────────────────────────────┐                      │
│    │  Iceberg table                  │  ◄──  owned,         │
│    │  owned · versioned · audited    │       versioned      │
│    └─────────────────────────────────┘       artifact       │
│                                                             │
└─────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│  FEDERATED ALIAS, NO OWNED ARTIFACT                         │
│                                                             │
│     Consumer                                                │
│        │                                                    │
│        ▼                                                    │
│      Trino         ◄──  same engine, but no artifact        │
│        ╎                beneath the contract                │
│        ╎  runtime                                           │
│        ├╌╌╌╌╌╌╌╌╌╌╌►  Iceberg                               │
│        ├╌╌╌╌╌╌╌╌╌╌╌►  Postgres (OLTP)                       │
│        └╌╌╌╌╌╌╌╌╌╌╌►  MongoDB                               │
│                                                             │
│     ╎  dashed = runtime translation, no materialization     │
└─────────────────────────────────────────────────────────────┘

     The engine is not the issue; ownership of the artifact is.
```

A stable, versioned, owned artifact can take several shapes:

* A physical table in a warehouse or lakehouse (Parquet, Iceberg, Delta)
* An incrementally maintained materialized view with a defined refresh policy
* A versioned snapshot or time-travel table
* A zero-copy clone over immutable files

What they share: someone owns it, it has a version, it has known freshness, and its shape does not change beneath the consumer without a release cycle.

**The operational test:** "What did this dataset look like at 3pm yesterday, and who accessed it?" If the answer is not deterministic, there is no product.

### Consumption

Once ownership is settled, the question is how the product will be consumed. The consumption pattern determines freshness, latency, concurrency, and consistency requirements, which in turn determine what engine can credibly serve the contract.

Common consumption patterns:

* **Analytical / BI.** Wide scans, aggregations, a few seconds of latency, moderate concurrency. Humans behind dashboards and analysts behind notebooks.
* **Operational / API.** Point lookups, millisecond latency, high concurrency, and sometimes transactional consistency. Applications calling for single records or small result sets.
* **Real-time analytics.** Sub-second aggregation over fresh data, very high concurrency. Product features, live dashboards, and in-app metrics.
* **AI training.** Large batch reads, strict reproducibility, full lineage. Runs periodically, tolerates minutes of latency, cannot tolerate undefined inputs.
* **AI inference / feature serving.** Millisecond reads of precomputed features and very high concurrency are consistent with training definitions.

### Engine

With ownership established and the consumption pattern named, the engine decision becomes practical rather than philosophical.

**OLTP engines** (Postgres, MySQL, Spanner, CockroachDB). Row-oriented, tuned for point lookups, high write concurrency, and transactional consistency. The right home for operational Data Products. The wrong home for analytical scans.

**OLAP engines** (three sub-families):

* Cloud data warehouse and lakehouse (Snowflake, BigQuery, Redshift, Databricks). Batch and near-real-time analytics, wide scans, large volumes. The workhorse for analytical products, AI training datasets, and BI.
* Real-time and serving OLAP (ClickHouse, Druid, Pinot, StarRocks). Sub-second aggregation over fresh data at high concurrency. User-facing analytics, operational dashboards, ML feature serving.
* Embedded OLAP (DuckDB, Polars). In-process, single-node. CI tests, local development, small products.

**Federation engines** (Trino, Dremio, Denodo, Spark SQL federation). Query-time translation, no owned storage. Useful for exploratory cross-source queries and bootstrapping before a canonical model exists. Not suitable as the foundation of a consumer-facing product because they own no artifact, so nothing durable sits beneath the contract.

The decision reduces to matching the consumption pattern to the engine family: milliseconds and point lookups go to OLTP or serving OLAP; seconds and wide scans go to cloud data warehouse; sub-second aggregation at high concurrency goes to serving OLAP; reproducible batch reads for training go to the lakehouse.

### One Data Product, one engine

A tempting mistake is to serve a single Data Product from multiple engines under one name. It looks efficient. It is not.

A Data Product has one contract: one SLA, one freshness guarantee, one consistency model, one performance envelope. Those guarantees are engine-dependent. Two engines serving the same product mean two different sets of guarantees living under one name. When they diverge (and they will), no one knows which was the source of truth at which time, and the contract stops meaning anything.

The right pattern when multiple consumption shapes exist is to build multiple Data Products, chained. A gold analytical product lives in the warehouse. A serving product for the API is derived from it, materialized into a serving OLAP engine, with its own owner and its own contract. A feature product for inference is derived similarly, with its own SLA. Each has one engine, one contract, one owner. The lineage between them is explicit.

```
┌───────────────────────────────────────────────────────────────┐
│  ONE PRODUCT, THREE ENGINES (wrong)                           │
│                                                               │
│             ┌─────────────────────────────┐                   │
│             │      Customer 360           │                   │
│             │  (one name, three SLAs)     │                   │
│             └──────────────┬──────────────┘                   │
│                            │                                  │
│            ┌───────────────┼───────────────┐                  │
│            ▼               ▼               ▼                  │
│       ┌─────────┐   ┌─────────────┐   ┌─────────┐             │
│       │Warehouse│   │ Serving OLAP│   │   API   │             │
│       └─────────┘   └─────────────┘   └─────────┘             │
│                                                               │
│      One name covering three different sets of guarantees.    │
│      When they diverge, no one knows which is the truth.      │
└───────────────────────────────────────────────────────────────┘

┌───────────────────────────────────────────────────────────────┐
│  CHAINED PRODUCTS, ONE ENGINE EACH (right)                    │
│                                                               │
│      ┌───────────────────────────┐                            │
│      │  Customer 360 · Gold      │  ◄──  analytical product   │
│      │  Warehouse                │       own owner, SLA,      │
│      └─────────────┬─────────────┘       contract             │
│                    │                                          │
│           ┌────────┴────────┐                                 │
│           ▼                 ▼                                 │
│  ┌──────────────────┐  ┌──────────────────┐                   │
│  │ Customer 360 ·   │  │ Customer 360 ·   │                   │
│  │ Serving          │  │ Features         │                   │
│  │ Real-time OLAP   │  │ Feature Store    │                   │
│  └──────────────────┘  └──────────────────┘                   │
│       ▲                     ▲                                 │
│       └─ derived from Gold, with own contract and SLA         │
│                                                               │
│     Three Data Products. Three engines. Three contracts.      │
│     Shared storage is fine. Shared contracts are not.         │
└───────────────────────────────────────────────────────────────┘
```

One Data Product. One engine. One contract.

## Is This a Data Product?

A proposal or existing artifact is a Data Product if, and only if, it passes all eight of the following. Anything short of that is a dataset, a pipeline output, a view, or a prototype. Those are all legitimate things. They are not Data Products and should not be labeled or governed as such.

**Identity and intent**

1. **Named owner:** A single team or individual is accountable for quality, evolution, and deprecation. "The data team" is not an owner. A name is.
2. **Written spec:** The product has a document describing its consumers, the decisions or actions it enables, the questions it answers, and what is explicitly out of scope. If the spec does not exist in writing, the product does not exist.
3. **Identified consumers:** At least one named consumer (human, pipeline, model, or application) exists and has agreed the product meets their need. A hypothetical future consumer does not count.

**Foundation**

4. **Owned artifact:** A versioned, materialized artifact sits beneath the contract. Not a virtual view over sources the producing team does not control. Not a federated alias. An artifact.
5. **One engine:** The product is served through a single engine with a single SLA and a single consistency model. Multiple consumption shapes mean multiple chained products, each with its own contract.

**Contract**

6. **Published contract:** Schema, semantics, SLA, freshness, versioning policy, and support model are documented and discoverable, not buried in a document no pipeline can read.
7. **Quality gates in place:** Tests, audits, and assertions run as part of the build and block publication of bad data. After-the-fact alerting does not qualify.
8. **Versioning policy:** A stated policy for how breaking changes are handled: deprecation windows, consumer migration, version retirement.

A product that passes all eight is a Data Product. A product that passes five or six is a candidate to bring to this bar, not something to ship under the label. A product that passes fewer is a dataset, a view, or a pipeline output, and that is what it should be called.

## Data Product tiers

Not every Data Product warrants the same rigor. A product feeding regulatory reporting must clear a higher bar than a product serving one team's internal dashboard. A single standard applied to both either over-engineers the second or under-protects the first. Three tiers are defined below with differentiated obligations. Every Data Product belongs to exactly one.

### Tier 1: critical

Products whose failure has material business, regulatory, or customer consequences. Examples: financial reporting inputs, customer-facing data served in production applications, inputs to models making consequential decisions (credit, fraud, pricing), and data subject to regulatory audit.

Obligations:

* Passes the full decision rubric without exception
* SLA is contractual, measured continuously, and reported
* Quality gates include completeness, accuracy, and reconciliation checks, not just schema validation
* Breaking changes require a minimum deprecation window and explicit consumer sign-off
* Reviewed by Governance at inception and re-reviewed annually
* Documented lineage end-to-end, including upstream systems

### Tier 2: core

Products consumed by multiple teams or domains where correctness and stability are important but the product is not regulatory or customer-visible. Examples: shared entity tables (customer, product, order), metrics used across domains, feature datasets used by multiple ML projects.

Obligations:

* Passes the full decision rubric
* SLA is published and monitored
* Quality gates include schema and completeness checks and accuracy checks where feasible
* Breaking changes require a deprecation window but not formal consumer sign-off
* Documented lineage to named upstream sources

### Tier 3: local

Products consumed by a single team for a bounded purpose. Examples: a team's internal dashboard data, exploratory products still maturing, domain-specific aggregations used by one analytics group.

Obligations:

* Passes the decision rubric, with lighter contract and consumer set requirements
* No published SLA required; best-effort freshness documented
* Quality gates include schema validation at minimum
* Breaking changes communicated to known consumers; no formal deprecation window required
* No external review required, but registered in the catalog

### Moving between tiers

Tiers are not static. A Tier 3 product that accumulates cross-team consumers graduates to Tier 2. A Tier 2 product adopted by a Tier 1 system either pulls its obligations up or gets forked into a Tier 1 variant with tighter guarantees. Downgrading a Tier 1 product is rare and requires explicit governance review, because consumers have built on the stronger guarantees.

Tiers are not a quality ranking. A Tier 3 product is not lower-quality than a Tier 1 product within its scope; it has a smaller scope and correspondingly lighter obligations. A Tier 3 product that fails its local consumers is just as broken as a Tier 1 product that fails regulatory reporting. The blast radius differs, not the standard within each tier.

Maintaining that standard means recognizing, early, the shapes that cannot meet it regardless of how they are labeled.

## Anti-patterns to avoid

The following shapes are excluded from the "Data Product" label, not because each is always wrong in isolation, but because calling them Data Products dilutes the term until it means nothing.

* **The data swamp:** A warehouse full of tables no one owns, documents, or can confidently use. An uncurated warehouse is a cost center that erodes trust every quarter it goes untended.
* **The dashboard as product:** A BI dashboard is treated as the deliverable, with no underlying artifact other consumers can use. A dashboard is a view of a product, not the product itself. Without the artifact beneath, the logic lives in the dashboard and cannot be reused, audited, or evolved.
* **The federated alias:** A "Data Product" that is a runtime query over sources the producing team does not own: a view with a nicer name. Nothing durable sits beneath the contract, and the promises it makes are promises someone else has to keep without knowing they were made.
* **The shared contract across engines:** One Data Product served from a warehouse, a serving OLAP, and an API under a single name, as if the guarantees were the same. One name and multiple SLAs is two products pretending to be one.
* **The permanent beta:** A product that has been in alpha for two years, used in production, with no owner willing to commit to an SLA. Undefined maturity is a governance gap: consumers assume production, producers assume experiment, and the gap is filled with incidents.
* **The orphaned product:** A product whose original team disbanded, still running, still consumed, with no one empowered to change or deprecate it. Ownership is a property of now, not of history. An unowned product must be re-owned or retired.
* **Governance by document:** Policies, definitions, and classifications that live in documents no pipeline can read. Governance that cannot be queried cannot be enforced, and in a world of AI consumers, unenforceable governance is no governance at all.
* **The pipeline was mistaken for the product:** The transformation job is shipped, scheduled, and monitored, but there is no contract, no consumer in mind, and no ownership of the output. The pipeline is plumbing. The product is the output plus its contract.

Rejecting these patterns has real costs.

## What adopting this approach costs

Adopting the Data Product approach has real costs in time, storage, and discipline. Those costs are named here because a position whose costs are hidden is one that will be abandoned the first time it is inconvenient.

* **Slower time to first query:** A Data Product begins with a spec, a contract, and a deliberate model. For one-off questions, this is overhead. The cost is accepted because most data work is not one-off, and the cost of skipping the spec compounds across every future consumer.
* **Storage for materialized artifacts:** A materialized artifact takes more storage than a virtual view. In a modern data stack, storage is the cheapest component by a wide margin. The reproducibility, consistency, and performance bounds a materialized artifact provides justify the trade-off.
* **Fewer products, more deliberately built:** A team applying these standards ships fewer Data Products per quarter than a team shipping raw tables. A smaller number of trusted products consumed by many is worth more than a large number of untrusted tables consumed cautiously.
* **Ongoing maintenance:** Products have lifecycles. They require owners, SLA monitoring, version management, and deprecation communication. A Data Product is never done the way a one-off extract is.
* **Political cost of saying no:** Applying these standards sometimes means refusing to label something a Data Product when a stakeholder wants the label. The label means something only if it is withheld when it does not fit.
* **Up-front modeling cost:** Deliberate data modeling takes longer than reflecting whatever shape the source system produced. The payoff is downstream in query intuitiveness, evolution cost, and composability, but the cost is borne upfront. A bad model is a tax every consumer pays indefinitely.

## Roles and responsibilities

A Data Product is only as owned as the roles around it. Vague accountability is the single most common failure mode.

### The Data Product team

One team builds, runs, supports, and evolves the product end-to-end. The team owns the code (transformations, semantic model, validation), ships releases (versioned materializations), runs the service (observability, on-call, incident response), supports consumers (contract questions, deprecation notices), and plans the product's evolution.

The team is not responsible for running the underlying infrastructure: the warehouse, the lakehouse, the serving engines, and the orchestrator. It consumes those the way any engineering team consumes infrastructure. It is responsible for everything above that line, from the contract to the consumer.

### The Data Product Lead

A named individual inside the Data Product team who owns the contract and is the accountable face of the product externally. The Lead sets the SLA, approves or rejects consumer requests for changes, declares deprecation, and makes the final call when the contract is in dispute.

The Lead is not a separate role filled by a separate person. It is a hat worn by one team member. Everyone on the team builds, runs, and supports the product; the Lead additionally carries the external accountability. The Lead is not a hierarchical position; they do not manage the team in a line-management sense.

### Governance

The function is accountable for policies that cross all Data Products: classification (PII, sensitive, restricted), access controls, retention, regulatory compliance, and the rubric and tier definitions. Reviews Tier 1 products at inception and annually. Adjudicates disputes that cross product boundaries or require policy interpretation.

Data Product Teams operate within the policies Governance sets; they do not reinvent those policies per product. Governance is not a co-owner of any product and not a blocker to be routed around. It is the function that makes cross-product guarantees possible.

### Platform and infrastructure

The Data Product team engages the infrastructure provider the way any engineering team does: requesting capacity, reporting issues, and escalating when the substrate is the problem. The platform owns the substrate beneath the product; the team owns the product. Shared ownership across that line is what produces the "no one is accountable when it breaks" failure mode.

### How disputes are resolved

* **Contract disputes** (what the product should do, what the SLA should be) are resolved by the Data Product Lead.
* **Execution disputes** (whether the product is meeting its contract) are resolved within the data product team. If the root cause is infrastructural, the team escalates but remains accountable to the consumer throughout.
* **Policy disputes** (classification, access, compliance, and tier assignment) are resolved by Governance.

This structure matters for human consumers. For AI agents, it is not a desirable feature. It is the minimum foundation for safe operation.

## Data Products and AI

A Data Product defends against every one of those five failure points described in the opening of this document. Its semantic model resolves what "revenue" means. Its quality gates surface whether the source is stale or not. Its lineage identifies which table is canonical. Its contract makes the answer auditable. A raw warehouse leaves every question open; a Data Product closes them by construction.

The deeper question is why AI agents need this more urgently than human analysts do.

### Why AI agents specifically require Data Products

A human analyst can work around missing context by asking a colleague, flagging uncertainty, or drawing on institutional knowledge. An AI agent has none of these options. Four structural reasons make Data Products unavoidable for AI, not just desirable.

* **Agents operate at a breadth no analyst team does:** A single agent can field questions across every department simultaneously. The informal context of any one analyst is useless at that breadth. Context must be externalized, shared, versioned, and machine-readable to be useful at scale.
* **Agents cannot improvise safely:** When a human analyst encounters an ambiguous table, they pause and verify before proceeding. An agent asked the same question produces a confident answer against whatever appears most plausible. The cost of a confidently wrong answer is substantially higher than pausing to clarify, and the only way to avoid the former is to give the agent context that is already correct.
* **Agents operate under contracts:** A human analyst can distinguish between a table that probably works and one that definitely works based on familiarity. An agent has no such discrimination. It needs machine-readable signals (SLAs, quality gates, data contracts) to know whether to trust what it is reading. Data Products carry these signals natively; raw tables do not.
* **Agents benefit from reuse:** Every agent in an organization can consume the same Data Product. Every new agent can be pointed at the same catalog. The context investment made once amortizes across every agent that follows, across every use case and every year. Without Data Products, every agent must carry its own context infrastructure, built from scratch, per agent, without reuse.

### The semantic model and AI accuracy

Without business context and semantics, AI and ML model accuracy tends to plateau in the range of 70 to 80 percent. Raw tables with column names like `cust_id`, `rev_ytd`, and `dt_created` are technically readable but semantically opaque to a language model. The model can write SQL against them, but it cannot know what the values mean in a business context, how they should be interpreted, or what constraints apply.

The semantic model inside a Data Product (glossaries, business descriptions, metric definitions, dimension relationships, and contextual metadata) is what allows AI to move from statistical pattern matching to business reasoning. When an agent resolves a question through a semantic model that defines "revenue" precisely, specifies the fiscal calendar, and identifies the canonical source, it is operating within a contract the organization has explicitly agreed to.

### Governance and AI

AI introduces two distinct governance concerns: what goes in and what comes out. The output side (ensuring AI-generated results are fair, unbiased, and transparent) is addressed by the AI or ML framework. The input side (ensuring the data fed to AI systems is compliant, accurate, and appropriate) is addressed by the Data Product.

A Data Product with natively enforced governance ensures that an AI agent only receives data that has been cleared for its consumption context. PII is masked at the product level before data reaches the agent. Access policies prevent the agent from querying data it is not authorized to use. The compliance and classification decisions are made once, in the product, rather than being re-implemented in every application that consumes the data.

### MCP and the AI consumption interface

The Model Context Protocol (MCP), originated by Anthropic and now broadly adopted, defines how AI agents connect to tools and data sources. For a Data Product, MCP is the natural agent interface.

Through MCP, an agent can read a product's metadata, inspect its schema, trace its lineage, check its quality, view its run history, and issue semantic queries against it. Every one of these interactions is grounded in the product's contract. The agent does not write raw SQL against arbitrary tables; it resolves through the semantic layer, constrained by the product's definitions, and observed through its quality gates.

An agent operating against a raw warehouse is a major context-infrastructure project. The organization must build schema metadata, semantic definitions, quality signals, and lineage tracking from scratch for every agent it deploys. An agent operating against a Data Product through MCP is a consumer of context infrastructure that has already been built. The agent's job shrinks from "assemble and reason about a data landscape" to "reason within a contracted, observable product."

## Data Products in DataOS

DataOS is built to make every aspect of the Data Product approach (design, build, governance, quality, semantics, discovery, and consumption) practically achievable within a single unified platform.

### Key terms

**SLA (Service Level Agreement):** The Data Product's published, measurable commitment to its consumers on freshness, availability, completeness, and quality. Machine-readable, so downstream pipelines and AI agents can gate on it programmatically.

**Freshness:** How recently the data in a Data Product reflects its source. Can be continuous (streaming), near-real-time (CDC), hourly (incremental batch), or daily (full refresh). Published as part of the SLA.

**Lineage:** The traceable path from a source system through every transformation to the Data Product and onward to every consumer. Column-level lineage traces individual fields, not just whole tables.

**Semantic model:** The bounded context of meaning that a Data Product owns: its column definitions, business rules, domain vocabulary, entities, relationships, and metrics as they exist within the product's own boundary. Every Data Product has one by construction. The semantic model is local, self-contained, and travels with the product.

**Semantic layer:** The architectural tier above the collection of Data Products, where individual semantic models are registered, cross-product relationships are made explicit, and the organizational ontology is held. The semantic layer holds the meaning of data and the relationships between products. It does not own data; Data Products do.

**CDC (Change Data Capture):** An ingestion pattern that replicates changes (inserts, updates, deletes) from a source system in near-real-time, rather than periodically re-reading the whole dataset.

**MCP (Model Context Protocol):** An open protocol defining how AI agents connect to tools and data sources. In DataOS, MCP is the wire through which AI agents consume Data Products: the protocol the agent uses to query semantics, inspect schema, check quality, and read product metadata.

### The three-stage flow

A Data Product in DataOS is supported across three stages: Discovery, Production, and Consumption. Discovery happens before any specific Data Product exists, when a team is answering, "What data do we have, and can we build what we want from it?" Production is where a decided-upon Data Product is built. Consumption is where the built Data Product is used by humans, applications, and agents.

The engine is the through-line across all three. In Discovery, a team inspects what lives in the engine and brings in what is missing. In Production, Data Products are built in the engine. In Consumption, consumers read from the engine. DataOS is the control plane; the engine is where the bytes live.

```
┌──────────────────────────────────────────────────────────────────────┐
│                        DataOS  (control plane)                       │
│                                                                      │
│   ┌───────────────┐    ┌───────────────┐    ┌───────────────┐        │
│   │   Discovery   │───►│  Production   │───►│  Consumption  │        │
│   │  before DP    │    │ Data Product  │    │  APIs · BI ·  │        │
│   │    exists     │    │   is built    │    │   AI agents   │        │
│   └───────┬───────┘    └───────┬───────┘    └───────┬───────┘        │
│           │                    │                    │                │
│           ▼                    ▼                    ▼                │
│   ┌──────────────────────────────────────────────────────────┐       │
│   │              ENGINE  (through-line)                      │       │
│   │   DataOS Lakehouse  (Iceberg + Spark/Trino)              │       │
│   │               OR                                         │       │
│   │   External engine  (Snowflake, BigQuery, Databricks,     │       │
│   │                     Postgres, ...)                       │       │
│   └──────────────────────────────────────────────────────────┘       │
│                                                                      │
│   ═══════════════════════════════════════════════════════════════    │
│   Governance · Lineage · Observability  (horizontal, across all)     │
│                                                                      │
└──────────────────────────────────────────────────────────────────────┘
```

### The engine

A Data Product materializes somewhere. That somewhere is the engine. DataOS operates on top of engines; it does not replace them. Two patterns exist, and both are first-class.

**Pattern A: DataOS Lakehouse:** DataOS provides a first-party Lakehouse built on Iceberg as the storage format, with Spark and Trino as the execution engines. The underlying object store can be S3, ADLS, or GCS. Teams that choose this pattern get a Lakehouse that DataOS governs natively: storage, compute, and governance in one coherent stack.

**Pattern B: Bring-your-own-engine:** Many organizations already have data in a governed warehouse (Snowflake, BigQuery, Databricks, Postgres) populated by existing pipelines (Fivetran, Airbyte, internal ETL). DataOS builds Data Products directly in that existing engine without requiring migration.

### Discovery

Discovery is the work a team does before any specific Data Product exists. It answers: "what data do we have, and can we build what we want from it?" No artifacts are produced at this stage. No contracts are signed. What Discovery produces is a decision: build it, ingest what is missing and then build it, or reconsider the approach.

**Metis** is the metadata and catalog layer. Before running a single SQL query, a team can browse what DataOS already knows about the data landscape: schemas, tables, columns, types, descriptions, tags, business terms, lineage traced back to source systems, and profile information (cardinalities, null rates, distributions, sample values). Metis turns a directionless search into a directed inspection.

**Workbench** is the interactive SQL and exploration environment. Once candidate datasets are identified, the team writes queries, inspects samples, validates hypotheses, and checks edge cases against the real substrate with governed access and captured lineage.

**Nilus** handles ingestion when Discovery concludes that the team needs data not currently in the engine. It covers batch and CDC ingestion, a wide range of source systems (operational databases, warehouses, streaming platforms, SaaS APIs), schema absorption at the boundary, data masking for sensitive fields, and metadata scanning and profiling that feeds back into Metis.

```
                    Question: what do we have,
                and can we build what we want from it?

   ┌───────────────────────┬──────────────────────────┬──────────────────┐
   │                       │                          │
   ▼                       ▼                          ▼
┌──────────────────┐  ┌──────────────────┐  ┌──────────────────┐
│     METIS        │  │    WORKBENCH     │  │      NILUS       │
│                  │  │                  │  │                  │
│  metadata        │  │  interactive SQL │  │  ingestion if    │
│  lineage         │  │  exploration     │  │  data is missing │
│  profile         │  │  EDA             │  │  batch + CDC     │
│                  │  │                  │  │  masking + scan  │
└────────┬─────────┘  └────────┬─────────┘  └────────┬─────────┘
         │                     │                     │
         └─────────────────────┼─────────────────────┘
                               ▼
                  ┌──────────────────────────┐
                  │       DECISION           │
                  │                          │
                  │  Yes:  build it          │
                  │  No:   rethink           │
                  │  Not yet:  ingest first  │
                  └──────────────────────────┘
```

The three clusters are not a linear sequence. A team might start in Metis, dip into Workbench to verify, realize data is missing, trigger Nilus, and then re-scan and explore again before deciding. Discovery is iterative.

### Production

Production is where Data Products are built. The engine hosts the data and runs the compute. The build stack above the engine is **Vulcan**, which turns raw landed data into a versioned, validated, contracted Data Product with a thin API layer for consumers.

Vulcan provides:

* **Declarative transformations in SQL or Python:** SQL for the bulk of transformations; Python for logic that is difficult in SQL (API calls, ML scoring, complex business rules). Both coexist in one project.
* **In-band validation:** A validation gauntlet before publication: a linter catches syntax errors, tests confirm expected outputs, signals verify dependencies, and assertions and quality checks block bad data from being published.
* **Semantic model:** Business metrics and dimensions declared once, consumed everywhere. The semantic definition is what consumers bind to, not raw tables. Train/serve skew is eliminated because offline analysis, online features, and AI agents all resolve through the same semantic definitions.
* **Materialization into the engine:** The Data Product is materialized as an artifact in the chosen engine: an Iceberg table in the Lakehouse, or a table in Snowflake, BigQuery, Databricks, or whichever applies.
* **APIs:** Auto-generated REST and GraphQL APIs from the semantic model. Consumers read the API rather than running raw SQL.
* **CI/CD with plan-style previews:** Changes are planned, diffed against the current state, reviewed, approved, and can be rolled back. Breaking changes follow API-style versioning.
* **Observability as first-class output:** Freshness, quality signals, run history, and lineage are emitted as machine-readable signals, not dashboard-only artifacts.

```
 ┌─────────────────────────────────────────────────────────────────────┐
 │                           VULCAN                                    │
 │            (declarative build stack above the engine)               │
 │                                                                     │
 │    SQL / Python models                                              │
 │          │                                                          │
 │          ▼                                                          │
 │    ┌─────────────────────────────────────────────────────────┐      │
 │    │  Linter ─► Tests ─► Signals ─► Assertions ─► Checks     │      │
 │    │              (in-band validation gauntlet)              │      │
 │    └─────────────────────────────────────────────────────────┘      │
 │          │                                                          │
 │          ▼                                                          │
 │    Semantic Model  (metrics, dimensions, relationships)             │
 │          │                                                          │
 │          ├──────►  Materialized artifact in the ENGINE              │
 │          └──────►  Auto-generated REST / GraphQL API                │
 │                                                                     │
 └─────────────────────────────────────────────────────────────────────┘
                              │
                              ▼
                  ┌──────────────────────┐
                  │      ENGINE          │
                  │  compute  +  storage │
                  └──────────────────────┘
```

**AI-assisted construction:** Through an MCP interface, Vulcan exposes build-time tools that an AI agent can invoke: concept explanation and syntax templates, code review for SQL and YAML, retrieval of real working examples from curated projects, design advisement (turning a use case into a structured specification), scaffold generation (a complete file manifest covering seed, staging, final, semantics, checks, and tests), metadata enrichment, and quality rule suggestion.

**Production is a lifecycle, not a one-shot build:** A Data Product's first release is not its final state. Consumers use it, contracts tighten, edge cases surface, business definitions shift, and new consumers arrive. Production covers the whole lifecycle: initial build, observation, iteration, versioned evolution, and eventual deprecation. The same Vulcan capabilities that build the product also carry it through continuous improvement.

### Consumption

Once a Data Product is built, DataOS exposes it through several parallel surfaces, each matched to a type of consumer. The contract behind each surface is the same (one product, one engine, one set of guarantees); the protocol adapts to the consumer.

* **REST and GraphQL APIs** auto-generated from the semantic model. Applications, services, and modern frontends call these directly.
* **Database wire protocols (Postgres, MySQL)** so BI tools and SQL clients connect as if the Data Product were a regular database. Tableau, Power BI, Superset, and Excel all reach Data Products this way.
* **SDKs and notebook access** for analysts and ML practitioners. Python SDKs expose Data Products to Jupyter, training pipelines, and custom applications.
* **MCP runtime tools for AI agents.** A set of structured tools through which an agent can query semantics, inspect schema, check quality and freshness, view run history, and read product metadata. This is the agent-equivalent of what BI does for humans, except machine-readable and semantically grounded.
* **Data Product Hub.** A catalog surface where humans find Data Products, understand their contracts, and activate them against the BI or ML tool of their choice.

```
 ┌─────────────────────────────────────────────────────────────────────┐
 │                        CONSUMERS                                    │
 │  Apps · BI tools · Notebooks · ML pipelines · AI agents · Hub       │
 └─────────────────────────────────────────────────────────────────────┘
                                  │
    ┌───────────────┬──────────┬──┴──┬─────────┬─────────────────┐
    ▼               ▼          ▼     ▼         ▼                 ▼
 ┌──────┐      ┌─────────┐  ┌─────────┐  ┌─────────┐     ┌──────────────┐
 │ REST │      │ GraphQL │  │ PG/MySQL│  │ Python  │     │ MCP tools    │
 │ API  │      │  API    │  │  wire   │  │  SDK    │     │ (schema,     │
 │      │      │         │  │         │  │         │     │  lineage,    │
 │      │      │         │  │         │  │         │     │  quality,    │
 │      │      │         │  │         │  │         │     │  semantic    │
 │      │      │         │  │         │  │         │     │  query)      │
 └──┬───┘      └────┬────┘  └────┬────┘  └────┬────┘     └──────┬───────┘
    │               │            │            │                 │
    └───────────────┴────────────┴────────────┴─────────────────┘
                                  │
                                  ▼
                   ┌──────────────────────────┐
                   │  DATA PRODUCT            │
                   │  one contract,           │
                   │  one engine,             │
                   │  many surfaces           │
                   └──────────────────────────┘
```

### Governance, lineage, and observability

Three horizontal capabilities apply across all stages, regardless of which stage is active.

**Governance:** Classification (PII, sensitive, restricted) is applied to data as it is discovered and ingested, propagates through Production, and enforces at Consumption. Access control, row-level and column-level policies, and retention are properties of the product, declared once and enforced across every surface.

**Lineage:** Column-level lineage traces from source systems, through ingestion, through Vulcan transformations, into the materialized artifact, and onward to every consumption surface. This is lineage by construction, not lineage reconstructed from logs after the fact. Auditors, agents, and compliance reviewers all consume the same lineage graph.

**Observability:** Freshness, completeness, quality, run history, and cost are emitted as first-class signals from every stage. A Data Product that is stale, failed, or degraded announces itself rather than being discovered by a consumer.

```
   ┌───────────────────────────────────────────────────────────────┐
   │                    HORIZONTAL CAPABILITIES                    │
   │                                                               │
   │  Governance    │  Classification, access policy, retention    │
   │  Lineage       │  Column-level, source to consumer            │
   │  Observability │  Freshness, quality, runs, cost              │
   │                                                               │
   └───────────────────────────────────────────────────────────────┘
              ▲                   ▲                   ▲
              │ applied at        │ applied at        │ applied at
   ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐
   │    Discovery    │  │   Production    │  │  Consumption    │
   │ scan, classify  │  │ transform under │  │ enforce policy  │
   │ raw sources     │  │ classification  │  │ at every access │
   └─────────────────┘  └─────────────────┘  └─────────────────┘
```

### The semantic layer above Data Products

Individual Data Products do not know about each other. A marketing campaigns product has no built-in knowledge that campaigns influence revenue. The revenue ledger has no built-in knowledge that bonus disbursement depends on it. Semantic information is siloed at the product level by design, because that siloing is what makes each product independently deployable, owned, and versioned.

But organizational reasoning operates across products. When a question spans marketing data, revenue data, and the organizational understanding that campaigns influence revenue, that chain of reasoning has to live somewhere above the individual products.

The semantic layer is the architectural tier above the Data Products. It is the layer where each Data Product's semantic model is registered, where cross-product relationships are made explicit (the organizational ontology records typed connections between concepts across products), and where audience-aware framing becomes possible.

The semantic layer does not own data. Data Products do. The layer holds the meaning of data and the relationships between Data Products. This boundary is the single most important thing to get right: conflating the two breaks the architecture.

```
┌─────────────────────────────────────────────────────────────────────┐
│                         AI AGENTS                                   │
│              reasons through the layer below                        │
│              does NOT scrape Data Products directly                 │
└─────────────────────────────────┬───────────────────────────────────┘
                                  │
                                  ▼
┌─────────────────────────────────────────────────────────────────────┐
│                        SEMANTIC LAYER                               │
│                                                                     │
│   Registers each Data Product's semantic model                      │
│   Holds the cross-product organizational ontology                   │
│                                                                     │
│   campaign_spend    ──influences──►  pipeline_revenue               │
│   customer_feedback ──predicts────►  pipeline_revenue               │
│                                                                     │
│   Holds meaning and relationships. Does NOT own data.               │
└─────────────────────────────────┬───────────────────────────────────┘
                                  │
                                  ▼
┌─────────────────────────────────────────────────────────────────────┐
│                          DATA PRODUCTS                              │
│                                                                     │
│   ┌──────────────┐   ┌──────────────┐   ┌──────────────┐            │
│   │  Marketing   │   │   Revenue    │   │   Customer   │            │
│   │    Spend     │   │    Ledger    │   │   Feedback   │            │
│   │  semantic    │   │  semantic    │   │  semantic    │            │
│   │  model       │   │  model       │   │  model       │            │
│   └──────────────┘   └──────────────┘   └──────────────┘            │
│                                                                     │
│   Each product is sovereign. Each owns its own data.                │
└─────────────────────────────────────────────────────────────────────┘
```

A Data Product can answer questions that live inside its bounded context. The semantic layer is what lets AI agents answer questions that span Data Products without hallucinating connections the data did not actually contain. Reasoning across products is only safe when the cross-product relationships are explicit and typed.

### Full-stack and overlay patterns

Two adoption shapes exist, both first-class.

**Full-stack:** The organization adopts DataOS end-to-end. The DataOS Lakehouse is the engine. Discovery uses Metis, Workbench, and Nilus together. Vulcan builds Data Products. Consumption goes through the DataOS surfaces. Governance, lineage, and observability are DataOS-native across the board.

This fits when the organization is building its data platform fresh or undertaking a deliberate replatform, has no pre-existing warehouse investment to preserve, or values a coherent single-vendor stack with minimal integration surface.

**Overlay:** The organization already has a governed engine (Snowflake, BigQuery, Databricks) populated by existing pipelines (Fivetran, Airbyte, internal ETL). DataOS is adopted as an overlay: the existing engine stays, Discovery focuses on Metis and Workbench against what is already there, Vulcan builds Data Products inside the existing engine, and consumers access through the DataOS surfaces.

This fits when the organization has a significant investment in an existing warehouse and wants Data Product discipline without migration. The pain point is not ingestion or storage, but the discipline and consumability above the engine.

```
     FULL-STACK                              OVERLAY
     (new / replatform)                      (existing engine)

  ┌──────────────────────────┐            ┌──────────────────────────┐
  │  DataOS control plane    │            │  DataOS control plane    │
  ├──────────────────────────┤            ├──────────────────────────┤
  │  Discovery               │            │  Discovery               │
  │  Hub + Workbench         │            │  Hub + Workbench         │
  │  + Nilus (ingest)        │            │  (no ingestion)          │
  │          │               │            │          │               │
  │          ▼               │            │          ▼               │
  │  Vulcan build            │            │  Vulcan build            │
  │          │               │            │          │               │
  │          ▼               │            │          ▼               │
  │  DataOS Lakehouse        │            │  Existing engine         │
  │  Iceberg + Spark/Trino   │            │  Snowflake / BigQuery /  │
  │          │               │            │  Databricks / ...        │
  │          ▼               │            │          ▲               │
  │  Consumption surfaces    │            │  Existing pipelines      │
  │                          │            │  (Fivetran / Airbyte)    │
  │                          │            │          │               │
  │                          │            │          ▼               │
  │                          │            │  Consumption surfaces    │
  └──────────────────────────┘            └──────────────────────────┘
```

The two patterns are not a hierarchy. An organization can use overlay for some products and full-stack for others, or migrate over time, or stay in overlay permanently. What matters is that the Data Product principles are upheld on whichever pattern is chosen: one owned artifact per product, one engine per product, one contract per product, a named owner, published quality, and versioning that respects consumers.

Each DataOS component maps directly to a Data Product requirement:

| Data Product Requirement                    | DataOS Component                          |
| ------------------------------------------- | ----------------------------------------- |
| Connect to source systems                   | Depots                                    |
| Build ingestion pipelines                   | Nilus                                     |
| Apply business semantics and define metrics | Vulcan                                    |
| Catalog and make discoverable               | Data Product Hub                          |
| Govern and control access                   | Bifrost and the Policy Engine             |
| Monitor quality and enforce SLAs            | Built-in quality checks and observability |
| Enable self-service consumption             | Data Product Hub and Workbench            |
| Serve AI agents and applications            | APIs and MCP interface                    |

DataOS is designed so that Data Products sit above the compute engine rather than inside it. A Data Product built in DataOS can draw from Snowflake, BigQuery, Databricks, Postgres, or the DataOS native Lakehouse. The same contract, the same semantic layer, and the same consumption surfaces apply regardless of which engine holds the data. Consumers bind to the product's contract rather than to any particular storage system.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://v2.dataos.info/concepts/foundations/data-product.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.