> For the complete documentation index, see [llms.txt](https://v2.dataos.info/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://v2.dataos.info/dataos.md).

# DataOS

You are probably here because someone decided your team would adopt **Data Products**, or adopt DataOS, and the reasoning did not arrive with the decision. This page is the reasoning. It is written for the engineer who has to build the thing, not the executive who approved it, and it assumes you are skeptical, because you should be.

Data engineering patterns do not change for fashion. They change when an old constraint lifts, or a new need outgrows them. One has been lifted, and one has outgrown. Here is the chain, with nothing skipped.

***

## Your numbers are right. The table still is not usable.

You load raw data and transform it in place. This is ELT, and it is correct. Storage is cheap, compute is elastic, and the raw layer survives, so a wrong join is an edit, not a re-ingestion. None of that is the problem.

The problem is what ELT was never built to do. It solves *producing* a transformed table. It does nothing for *consuming* one. The numbers are right, and the table is still half a product.

{% hint style="info" %}
A table carries its rows and its schema. It does not carry how to find it, whether today's data is fresh enough to trust, who may see which rows, what a column actually means, or where to report it when it breaks.

That half lives in your head, in a wiki, and in the one analyst who knows `revenue` means net of refunds.
{% endhint %}

***

## You stack tools to fix that. They drift.

The usual answer is to surround the table with tools: a catalog for discovery, a quality platform for freshness, a governance tool for access, a semantic layer for definitions. Each is a real product solving a real gap.

Each is also a separate system with its own copy of your table and its own clock. That creates one failure that integration cannot remove: **the tools drift away from the table they describe.**

The table is the thing that changes. You rename a column, split a measure, change a grain. The other four systems find out later, each on its own schedule.

{% code title="one rename, four systems, no shared clock" %}

```shellscript
Monday     transform    column renamed in the model
Monday →   quality      check still passes on the OLD shape — green, but wrong
Monday →   governance   policy still guards the OLD column name — protects nothing
Thursday   catalog      re-crawls and finally catches up
```

{% endcode %}

For three days the tools describe a table that no longer exists, and none of them say so. Drift does not fail loudly. It waits until someone trusts a stale definition and ships a wrong decision.

You can fight drift, but only with people. Reconciliation stops being a task and becomes a standing program: tickets, sprints, and a recurring meeting whose only job is keeping four tools agreeing with each other. Every change you make spawns work in three other systems, and it never quite converges.

***

## An agent reads the drift and does not flinch.

A human absorbs drift without noticing. An analyst carries context that never reached a table: whose source is authoritative, that `status=3` was deprecated two years ago, which dashboard is the real one. When the catalog is three days stale, the analyst quietly compensates.

An AI agent carries none of that. It queries the same tables you would, at machine speed, with no one in the loop to wince.

| <p><code>analyst → "what was Q3 revenue?"</code></p><h3>$4.21M</h3><p><mark style="color:green;background-color:green;">net of refunds · authoritative source</mark></p> | <p><code>agent → "what was Q3 revenue?"</code></p><h3>$4.88M</h3><p><mark style="color:red;background-color:red;">gross · deprecated table · no flag</mark></p> |
| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------- |

The model is not the problem. The data underneath it is. An agent pointed at a drifted stack returns the wrong answer with the same confidence as the right one. The half-product problem was merely expensive while only humans consumed data. It turns dangerous the moment a machine does.

***

## The fix is not another layer. It is one object.

Drift is not caused by bad tools. It is caused by *separation*. Semantics, quality, and access each live in their own system on their own clock, and nothing binds them to the table. A fifth tool adds a fifth clock.

The only way to remove drift is to remove the separation. Author the semantics, the contract, the quality rules, and the access policy *with* the data, as one object. Then they cannot fall out of sync, because there is nothing to sync. One model cannot disagree with itself.

That one object is a **Data Product**.

{% hint style="info" %}
A Data Product is data you can hand to someone, or something, that never met you, and trust them to use it correctly.

Formally: a self-contained, managed unit of data with a named owner, a published contract, named consumers both human and machine, and a versioned lifecycle.
{% endhint %}

Following eight properties define one. Each earns its place by what it removes from your week.

<figure><img src="/files/7l10RpFTZr80nTO6ftr0" alt=""><figcaption></figcaption></figure>

The property that does the real work is the one you cannot see: a change to the data is a change to its definition, because they are the same edit. Drift never gets a window to open.

To be precise about the boundary, a data product is **not** a raw table dumped in the warehouse, not a dashboard, not a pipeline, and not a query alias that federates sources no one owns. The pipeline is plumbing. The product is the owned, contracted output built on top of it.

***

## This is the next pattern, not a new tool.

The data product is not a feature. It is the third pattern in a sequence you already work inside.

| Pattern | What it did                                                                  | Where it stopped                                                                                                         |
| ------- | ---------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------ |
| **ETL** | Transformed before load, to spare a scarce warehouse.                        | Discarded the raw. Every transform was bet-the-pipeline; a wrong join meant re-ingesting from a source that may be gone. |
| **ELT** | Loaded raw, then transformed in place once storage got cheap.                | Produced a correct table that only its author can safely use. Everything else gets bolted on, and drifts.                |
| **ELP** | Loads raw, then *productizes*: folds the transform into one authored object. | The current pattern. Semantics, contract, policy, and serving are part of the object, not layers around it.              |

The shift in one line: **Productize absorbs Transform.** The transform stops being a phase you surround with tools and becomes a property of the product.

```mermaid

%%{init: 
  "themeVariables": {
    "background": "#FFFFFF",
    "primaryTextColor": "#242422",
    "lineColor": "#242422",
    "fontFamily": "Neue Montreal, sans-serif"
  }
}}%%

flowchart LR
    E[Extract] --> L[Load raw ]
    L --> P

    subgraph P["Productize: one authored object "]
        direction TB
        SEM[Semantic model<br/>measures, contracts]
        PHY[Physical assets<br/>tables, views]
        GOV[Governance, SLAs, serving<br/>API + AI]

        SEM <--> PHY
        PHY --- GOV
    end

    classDef stage fill:#FFFFFF,stroke:#242422,color:#242422,stroke-width:1.5px;
    classDef product fill:#EDE9E5,stroke:#54DED1,color:#242422,stroke-width:1.5px;
    classDef semantic fill:#EDE9E5,stroke:#009293,color:#242422,stroke-width:1.5px;
    classDef physical fill:#D6CDC6,stroke:#35505B,color:#242422,stroke-width:1.5px;
    classDef governance fill:#EDE9E5,stroke:#733635,color:#242422,stroke-width:1.5px;

    class E,L stage;
    class SEM semantic;
    class PHY physical;
    class GOV governance;

    style P fill:#FFFFFF,stroke:#54DED1,stroke-width:2px,color:#242422;

    linkStyle 0 stroke:#242422,stroke-width:2px;
    linkStyle 1 stroke:#242422,stroke-width:2px;
    linkStyle 2 stroke:#009293,stroke-width:1.8px;
    linkStyle 3 stroke:#35505B,stroke-width:1.6px;
```

ELP is the pattern. DataOS is one implementation of it.

***

## What it asks of you, and what it costs.

The discipline is not new. It is software engineering, applied to data, and you already use it on code.

| You already do this in software                | A data product asks the same of data                                                                           |
| ---------------------------------------------- | -------------------------------------------------------------------------------------------------------------- |
| Write a spec before the code.                  | Write a product spec first: consumers, the questions it answers, what is out of scope.                         |
| Publish an interface, hide the implementation. | Declare schema, semantics, and SLA first. Consumers bind to the contract, not to storage paths.                |
| Tests gate the merge.                          | Quality runs in the build and blocks publication. Bad data is caught, not alerted on afterward.                |
| Version the API; deprecate, do not break.      | Schema and meaning change through versions, so consumers, including models in production, are never surprised. |

It is fair to be told the cost, not just the benefit.

* **It is more work up front.** A spec and a contract come before the pipeline. There is no product without them. The payoff is later and larger: the reconciliation program disappears, because there are no longer four systems to reconcile.
* **It does not rescue a bad model.** Modeling is still the craft. A product wrapped around a confused schema is a confused product with a contract.
* **It will reject things you call data products today.** A federated alias over sources you do not own has nothing durable for a contract to bind to. That is a query, not a product.
* **A product built for a dashboard is not automatically fit for an agent.** Data summarized for human reading often lacks the detail a model needs. Fitness is per consumer, and it is designed, not inherited.

***

## How DataOS builds them

DataOS produces and runs data products from the data where it already lives. It runs over the warehouse, lakehouse, or catalog you already own, so you can start with one product on one source rather than a replatform.

Building a data product in DataOS follows three stages.

{% columns %}
{% column %}
`01` **Discover**

Find what data exists, inspect its quality and lineage, and decide whether to build on existing sources or connect new ones.
{% endcolumn %}

{% column %}
`02` **Productize**

Build it: transformations, validation, semantic models, versioned outputs, and CI/CD, with data and outputs kept in the engine.
{% endcolumn %}

{% column %}
`03` **Consume**

Expose one contract across SQL, REST, GraphQL, BI protocols, Python SDKs, and MCP tools for agents.
{% endcolumn %}
{% endcolumns %}

***

You were not asked to adopt a tool. You were asked to ship the contract along with the table, so the next person, or the next agent, can use your work without first finding you.

## Next steps

* [Data Product Journey](/data-product-journey-v2.md): Follow one business question through all three stages as a worked example.
* [For Builders](https://v2.dataos.info/build/): Build the Orders Analytics product end to end.
* [For Consumers](https://v2.dataos.info/consume/): Consume a product across SQL, APIs, BI, and agents.
* [For Operators](https://v2.dataos.info/operate/): Operate products with ongoing trust, quality, and governance.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://v2.dataos.info/dataos.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.