> For the complete documentation index, see [llms.txt](https://v2.dataos.info/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://v2.dataos.info/build/stage-2-productize/define-models-and-logic/data-models/python-models.md).

# Python models

{% hint style="warning" %}
**Supported model kinds:** `FULL`, `INCREMENTAL_BY_TIME_RANGE`, `INCREMENTAL_BY_UNIQUE_KEY`, `INCREMENTAL_BY_PARTITION`

Python models do **not** support `VIEW`, `SEED`, `EMBEDDED`, or `MANAGED` kinds. Use SQL models for those.
{% endhint %}

Python models let you write transformation logic in Python instead of SQL. They are useful when your transformation is too complex for SQL alone, requires Python libraries (pandas, numpy, scikit-learn), or benefits from Python control flow.

Python models use the same `@model` decorator as a DDL block and an `execute` function as the DML.

***

## Minimal example

{% code overflow="wrap" %}

```python
import pandas as pd
from vulcan import ExecutionContext, model

@model(
    "analytics.daily_sales_py",
    columns={
        "order_date": "date",
        "total_orders": "int",
        "total_revenue": "decimal(18,2)",
    },
    kind="FULL",
    grains=["order_date"],
    depends_on=["raw.raw_orders"],
)
def execute(context: ExecutionContext, **kwargs) -> pd.DataFrame:
    query = """
    SELECT
      CAST(order_date AS DATE)          AS order_date,
      COUNT(order_id)::INTEGER          AS total_orders,
      SUM(total_amount)::DECIMAL(18,2)  AS total_revenue
    FROM raw.raw_orders
    GROUP BY order_date
    ORDER BY order_date
    """
    return context.fetchdf(query)
```

{% endcode %}

***

## Model decorator properties

The `@model` decorator takes the same metadata as the SQL `MODEL` block. Pass it as keyword arguments.

| Parameter                     | Description                                                                                                          |
| ----------------------------- | -------------------------------------------------------------------------------------------------------------------- |
| `name` (first positional arg) | Fully-qualified model name (`schema.table`).                                                                         |
| `columns`                     | Dict of `column_name → SQL type`. **Required** for Python models since Vulcan cannot infer types from the DataFrame. |
| `kind`                        | Materialization kind as a string or `dict` for parameterized kinds.                                                  |
| `grains`                      | List of grain columns.                                                                                               |
| `depends_on`                  | List of upstream model references. Required when the dependency is not obvious from the SQL inside the function.     |
| `cron`                        | Schedule string (e.g. `'@daily'`).                                                                                   |
| `start`                       | Backfill start date string.                                                                                          |
| `description`                 | Model description.                                                                                                   |
| `tags`                        | List of tag strings.                                                                                                 |
| `terms`                       | List of glossary term strings.                                                                                       |
| `owner`                       | Owner string.                                                                                                        |
| `column_descriptions`         | Dict of `column_name → description`.                                                                                 |
| `column_tags`                 | Dict of `column_name → list of tags`.                                                                                |

***

## INCREMENTAL\_BY\_TIME\_RANGE

For incremental models, pass the `kind` as a dict and use the `start` and `end` parameters in `execute`:

{% code overflow="wrap" %}

```python
from vulcan import ExecutionContext, model, ModelKindName

@model(
    "analytics.incremental_sales_py",
    columns={
        "order_date": "date",
        "product_id": "int",
        "total_sales": "decimal(10,2)",
    },
    kind=dict(
        name=ModelKindName.INCREMENTAL_BY_TIME_RANGE,
        time_column="order_date",
    ),
    grains=["order_date", "product_id"],
    depends_on=["raw.raw_orders", "raw.raw_products"],
    start="2024-01-01",
    cron="@daily",
)
def execute(context: ExecutionContext, start, end, **kwargs):
    query = f"""
    SELECT
      CAST(o.order_date AS DATE)   AS order_date,
      p.product_id,
      SUM(oi.quantity * oi.unit_price) AS total_sales
    FROM raw.raw_orders AS o
    JOIN raw.raw_order_items AS oi ON o.order_id = oi.order_id
    JOIN raw.raw_products AS p ON oi.product_id = p.product_id
    WHERE o.order_date BETWEEN '{start}' AND '{end}'
    GROUP BY o.order_date, p.product_id
    """
    return context.fetchdf(query)
```

{% endcode %}

`start` and `end` are automatically injected by Vulcan for the interval being processed.

***

## INCREMENTAL\_BY\_UNIQUE\_KEY

```python
@model(
    "analytics.customers_py",
    columns={
        "customer_id": "int",
        "total_spent": "decimal(10,2)",
        "last_order_date": "date",
    },
    kind=dict(
        name=ModelKindName.INCREMENTAL_BY_UNIQUE_KEY,
        unique_key=["customer_id"],
    ),
    grains=["customer_id"],
    depends_on=["raw.raw_orders", "raw.raw_customers"],
)
def execute(context: ExecutionContext, **kwargs):
    query = """
    SELECT
      c.customer_id,
      COALESCE(SUM(oi.quantity * oi.unit_price), 0) AS total_spent,
      MAX(o.order_date)                              AS last_order_date
    FROM raw.raw_customers AS c
    LEFT JOIN raw.raw_orders AS o ON c.customer_id = o.customer_id
    LEFT JOIN raw.raw_order_items AS oi ON o.order_id = oi.order_id
    GROUP BY c.customer_id
    """
    return context.fetchdf(query)
```

***

## FULL model

```python
@model(
    "analytics.product_summary_py",
    columns={
        "product_id": "int",
        "product_name": "string",
        "category": "string",
        "total_sales": "decimal(10,2)",
    },
    kind=dict(name=ModelKindName.FULL),
    grains=["product_id"],
    depends_on=["raw.raw_products", "raw.raw_order_items", "raw.raw_orders"],
)
def execute(context: ExecutionContext, **kwargs):
    query = """
    SELECT
      p.product_id,
      p.name       AS product_name,
      p.category,
      COALESCE(SUM(oi.quantity * oi.unit_price), 0) AS total_sales
    FROM raw.raw_products AS p
    LEFT JOIN raw.raw_order_items AS oi ON p.product_id = oi.product_id
    LEFT JOIN raw.raw_orders AS o ON oi.order_id = o.order_id
    GROUP BY p.product_id, p.name, p.category
    ORDER BY total_sales DESC
    """
    return context.fetchdf(query)
```

***

## ExecutionContext

`context` is the interface between your Python function and the warehouse. Use it to fetch data and run SQL:

| Method                   | Description                                                      |
| ------------------------ | ---------------------------------------------------------------- |
| `context.fetchdf(query)` | Execute a SQL query and return the result as a pandas DataFrame. |
| `context.execute(query)` | Execute a SQL statement without returning data (DDL, DML).       |
| `context.var("name")`    | Access a project variable defined in `config.yaml`.              |

```python
def execute(context: ExecutionContext, **kwargs):
    schema = context.var("warehouse_schema")
    query = f"SELECT * FROM {schema}.orders LIMIT 100"
    return context.fetchdf(query)
```

***

## Returning a DataFrame

The `execute` function must return a pandas DataFrame. Vulcan writes the DataFrame to the warehouse automatically.

* Column names in the DataFrame must match the names declared in `columns`.
* Column types are coerced to the types declared in `columns`.
* Return an empty DataFrame to produce an empty table; don't return `None`.

***

## When to use Python over SQL

Use Python models when you need to:

* Apply machine learning models or statistical functions not available in SQL.
* Use Python libraries (e.g. scikit-learn, scipy, geopy).
* Handle complex control flow that SQL cannot express cleanly.
* Read from external APIs or files and load results into the warehouse.

For everything else, prefer SQL models. They are simpler, easier to debug, and run entirely in the warehouse without Docker overhead.

**Note on orders-analytics:** All transformation logic in `orders-analytics` uses SQL models. The RFM segmentation scoring, customer profiling, and funnel analysis are all expressible in standard SQL with CASE expressions and CTEs. Python macros (`safe_ratio`, `revenue_order_filter`) are used for reusable SQL generation, but the models themselves are `.sql` files. Reach for Python models only when your logic genuinely requires Python libraries or dynamic control flow.

***

## Limitations

* Python models do not support `EMBEDDED` or `SEED` kinds.
* Python models do not support `VIEW` kind (use SQL for views).
* Python models do not support `MANAGED` kind.
* All dependencies must be listed in `depends_on` explicitly; Vulcan cannot parse Python code for implicit dependencies.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://v2.dataos.info/build/stage-2-productize/define-models-and-logic/data-models/python-models.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
