> For the complete documentation index, see [llms.txt](https://v2.dataos.info/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://v2.dataos.info/concepts/resources/nilus/pipeline-optimization/optimize-sink-datasets/optimize-shape-knobs.md).

# Shape Knobs

*Part of* [*Optimize Sink Datasets*](/concepts/resources/nilus/pipeline-optimization/optimize-sink-datasets.md)*.*

These knobs decide *how* the destination dataset is laid out: partitions, clustering, nesting. They affect downstream query performance, not pipeline runtime.

## `partition_by` (sink)

| Goes under     | Default | Type                       |
| -------------- | ------- | -------------------------- |
| `sink.options` | unset   | array of partition objects |

Physical partitioning for query pruning. Most useful on lakehouse-style destinations (Iceberg, DataOS Lakehouse) where consumers filter by date/region/customer.

Each partition object accepts:

| Field          | Required    | Description                                                                    |
| -------------- | ----------- | ------------------------------------------------------------------------------ |
| `column`       | Yes         | Source column to derive the partition from.                                    |
| `type`         | Yes         | One of `year`, `month`, `day`, `hour`, `bucket`, `identity`.                   |
| `name`         | Yes         | Display name for the partition (used in the destination's metadata catalogue). |
| `index`        | Yes         | Integer position (1-based) for ordering when multiple partitions are defined.  |
| `bucket_count` | Conditional | Required when `type: bucket`; ignored otherwise.                               |

### Example, single date rule

```yaml
spec:
  sink:
    address: dataos://lakehouse
    options:
      dest_table: analytics.transactions_raw
      incremental_strategy: append
      partition_by:
        - column: transaction_date
          type: day
          name: transaction_day
          index: 1
```

### Example, date + bucket rule

```yaml
spec:
  sink:
    address: dataos://lakehouse
    options:
      dest_table: analytics.orders_raw
      incremental_strategy: append
      partition_by:
        - column: order_date
          type: year
          name: order_year
          index: 1
        - column: country
          type: bucket
          name: country_bucket
          bucket_count: 16
          index: 2
```

### Anti-patterns

* **Partition explosion.** `type: identity` on a high-cardinality column (e.g. `user_id` on a 10M-user table) creates 10M partitions and breaks the destination's metadata catalogue. Use `type: bucket` with a sensible `bucket_count` instead.
* **Mismatched cardinality between partition and `loader_file_size`.** A `type: hour` partition on a low-volume table produces thousands of tiny files (one per hour, per loader-file split). Pick a coarser partition (`day` or `month`) or accept that small files are the norm for low-volume sources.
* **Adding partitions before confirming the access pattern.** Add a partition only when consumers repeatedly filter by that column. Otherwise the partition adds write-side cost with no read-side benefit.

## `cluster_by` (sink)

| Goes under     | Default | Type             |
| -------------- | ------- | ---------------- |
| `sink.options` | unset   | array of strings |

Secondary physical organization. Different destinations interpret this differently:

* **Snowflake**: maps to `CLUSTER BY (...)` on the destination table. Snowflake auto-clusters in the background. No DDL action required. The auto-clustering service uses warehouse credits.
* **BigQuery**: maps to clustered-table DDL when the table is created. BigQuery does **not** re-cluster existing tables in place. Re-clustering after-the-fact requires a `CREATE TABLE AS SELECT` rebuild.
* **Databricks**: maps to `CLUSTER BY (...)` when supported by the table format (Delta v3+ liquid clustering).
* **Other warehouses (Redshift / Synapse / MS SQL Server)**: typically no-op or destination-specific behavior. Check the destination connector page.

Start with one frequently-filtered or join column. Use a second column only if the destination and workload clearly justify it.

```yaml
spec:
  sink:
    address: snowflake://user:pass@account/db?warehouse=COMPUTE_WH&role=ANALYST
    options:
      dest_table: raw.page_views
      incremental_strategy: append
      cluster_by:
        - account_id
```

## `max_table_nesting` (source)

| Goes under       | Default                                                   | Type                             |
| ---------------- | --------------------------------------------------------- | -------------------------------- |
| `source.options` | `"0"` for SQL sources; varies for nested-document sources | **string** (`"0"`, `"1"`, `"2"`) |

Controls how aggressively Nilus flattens nested data structures (JSON columns, MongoDB documents, Binary JSON (BSON), JSON-typed responses from REST sources).

| Value | Meaning                                                                                                                                        |
| ----- | ---------------------------------------------------------------------------------------------------------------------------------------------- |
| `"0"` | Fully flatten. Nested objects become parent rows with dotted column names (`customer__address__city`). Recommended for SQL/BI-facing datasets. |
| `"1"` | Preserve one level of nesting. Top-level objects stay as JSON columns.                                                                         |
| `"2"` | Preserve two levels. Useful when downstream consumers expect partial nesting.                                                                  |

{% hint style="warning" %}
**Type matters**

The stack schema declares `max_table_nesting` as a string. Always quote the value (`"0"`, not `0`). An integer literal will be rejected by the JSON-schema validator.
{% endhint %}

## Example, flatten a MongoDB collection for SQL access

```yaml
spec:
  source:
    address: mongodb://user:pass@host:27017
    options:
      source_table: sample.users
      max_table_nesting: "0"
  sink:
    address: postgresql://user:pass@host:5432/warehouse
    options:
      dest_table: raw.users
      incremental_strategy: append
```


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://v2.dataos.info/concepts/resources/nilus/pipeline-optimization/optimize-sink-datasets/optimize-shape-knobs.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
