> For the complete documentation index, see [llms.txt](https://v2.dataos.info/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://v2.dataos.info/concepts/resources/nilus/pipeline-optimization/optimize-sink-datasets.md).

# Optimize Sink Datasets

This guide is the customer-facing reference for tuning Nilus pipelines so the **destination dataset** lands fast, lands cleanly, and stays cheap to query later. It covers every knob the Nilus stack schema accepts on `source.options` and `sink.options`, plus the runtime environment variables that are critical for high-volume loads.

It does **not** cover destination-native maintenance, Iceberg compaction, Snowflake auto-clustering jobs, BigQuery materialised view refreshes, Redshift `VACUUM`, etc. Those are scheduled by the destination owner, not by the Nilus pipeline.

## When NOT to optimize

For most pipelines, the defaults are correct. **Tune only when you see a concrete symptom**, not pre-emptively:

* Run is slow → look at `loader_file_size`, `loader_file_format`, `extract_parallelism`, and `LOAD__WORKERS`.
* Run runs out of memory → look at `page_size` and `extract_parallelism`.
* Destination has too many small files → raise `loader_file_size`.
* Destination is hard to query / scans too much → look at `partition_by`, `cluster_by`, `max_table_nesting`.
* Repeated runs duplicate rows or fail to refresh → look at `incremental_strategy`, `incremental_key`, `primary_key`.

If the run is fast enough, the destination is queryable, and your bills are reasonable, leave the configuration alone.

## The knob taxonomy

Nilus exposes 20-plus options across `source.options`, `sink.options`, and pipeline runtime environment variables. They group into four jobs:

| Category        | Job                                                                   | Knobs                                                                                                                                                               |
| --------------- | --------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Correctness** | Get the right rows into the right table of the destination            | `incremental_strategy`, `incremental_key`, `primary_key`, `type_hints`, `interval_start`, `interval_end`, `mask`, `full_refresh`                                    |
| **Shape**       | Decide how the destination dataset is physically laid out             | `partition_by`, `cluster_by`, `max_table_nesting`                                                                                                                   |
| **Throughput**  | Decide how fast and how cheaply the run completes                     | `loader_file_format`, `loader_file_size`, `page_size`, `extract_parallelism`, `staging_bucket`, `LOAD__WORKERS`, `NORMALIZE__WORKERS`, `LOAD__PARALLELISM_STRATEGY` |
| **Sampling**    | Limit extraction volume for testing, validation, or partial backfills | `sql_limit`, `yield_limit`, `sql_exclude_columns`, `sql_reflection_level`                                                                                           |

Every knob below is documented with its placement (`source.options` / `sink.options` / env var), default, valid values, and the symptom it solves.

{% hint style="warning" %}
**Stack-schema strictness**

`sink.options` is **`additionalProperties: false`** in the Nilus stack schema. Any key that is not in the documented sink-options list is rejected at YAML validation. Do not pass arbitrary low-level loader options through `sink.options`, if you need a new knob, request it.

`source.options` is `additionalProperties: true`. Source-side extras are accepted by the validator and passed through to the underlying source builder, but the **Nilus source wrappers only forward the documented kwargs**: anything else is silently ignored. Stick to the list below.
{% endhint %}

## Detailed knob reference

Each knob is documented with its placement (`source.options` / `sink.options` / env var), default, valid values, and the symptom it solves.

* [Correctness Knobs](/concepts/resources/nilus/pipeline-optimization/optimize-sink-datasets/optimize-correctness-knobs.md): get the right rows into the destination: `incremental_strategy`, `incremental_key`, `primary_key`, `type_hints`, `interval_start` / `interval_end`, `mask`, `full_refresh`.
* [Shape Knobs](/concepts/resources/nilus/pipeline-optimization/optimize-sink-datasets/optimize-shape-knobs.md): lay out the destination dataset: `partition_by`, `cluster_by`, `max_table_nesting`.
* [Throughput Knobs](/concepts/resources/nilus/pipeline-optimization/optimize-sink-datasets/optimize-throughput-knobs.md): make runs fast and cheap: `loader_file_format`, `loader_file_size`, `page_size`, `extract_parallelism`, `staging_bucket`, `LOAD__WORKERS`, `NORMALIZE__WORKERS`, `LOAD__PARALLELISM_STRATEGY`.
* [Sampling Knobs](/concepts/resources/nilus/pipeline-optimization/optimize-sink-datasets/optimize-sampling-knobs.md): cap extraction for testing and backfills: `sql_limit`, `yield_limit`, `sql_exclude_columns`, `sql_reflection_level`.
* [Destination Gotchas & Troubleshooting](/concepts/resources/nilus/pipeline-optimization/optimize-sink-datasets/optimize-destination-gotchas-and-troubleshooting.md): per-destination caveats, the troubleshooting matrix, and the stack-schema reference.

## Recommended profiles

Three starting points. Tune within each profile's range to match your runtime CPU/RAM and source limits.

### Faster, throughput-oriented

For large lakehouse / warehouse loads where wall-clock matters more than resource cost.

```yaml
spec:
  source:
    address: dataos://postgres-depot
    options:
      source_table: public.events
      page_size: 100000
      extract_parallelism: 8
  sink:
    address: dataos://lakehouse
    options:
      dest_table: analytics.events_raw
      incremental_strategy: append
      loader_file_format: parquet
      loader_file_size: 300000
  use:
    projection:
      projections:
        envVars:
          - key: LOAD__WORKERS
            template: "8"
          - key: NORMALIZE__WORKERS
            template: "2"
          - key: LOAD__PARALLELISM_STRATEGY
            template: "parallel"
```

### Lower-resources, memory/CPU-conservative

For shared runtimes, tight pod budgets, or pipelines that previously OOMKilled.

```yaml
spec:
  source:
    address: dataos://postgres-depot
    options:
      source_table: public.events
      page_size: 25000
      extract_parallelism: 1
  sink:
    address: dataos://lakehouse
    options:
      dest_table: analytics.events_raw
      incremental_strategy: append
      loader_file_format: parquet
      loader_file_size: 500000
  use:
    projection:
      projections:
        envVars:
          - key: LOAD__WORKERS
            template: "1"
          - key: NORMALIZE__WORKERS
            template: "1"
          - key: LOAD__PARALLELISM_STRATEGY
            template: "sequential"
```

### Balanced, default starting point

The shape of most well-behaved pipelines.

```yaml
spec:
  source:
    address: dataos://postgres-depot
    options:
      source_table: public.events
      page_size: 50000           # default
      extract_parallelism: 5     # default
  sink:
    address: dataos://lakehouse
    options:
      dest_table: analytics.events_raw
      incremental_strategy: append
      loader_file_format: parquet
      loader_file_size: 150000
  use:
    projection:
      projections:
        envVars:
          - key: LOAD__WORKERS
            template: "5"
          - key: NORMALIZE__WORKERS
            template: "1"
          - key: LOAD__PARALLELISM_STRATEGY
            template: "parallel"
```

## Sizing playbook

Step-by-step process for a new pipeline running on unfamiliar data.

1. **Baseline.** Run once with defaults, conservative compute. Record runtime, peak CPU/memory, network throughput, destination file count. If the run gets OOMKilled, your priority is memory mitigations (lower `extract_parallelism`, lower `LOAD__WORKERS`, lower `page_size` to 25k, keep `loader_file_size` large).
2. **Size files first.** Compute `row_bytes ≈ raw_table_size_bytes / total_rows`. Pick a target Parquet file size of 128–512 MB. `loader_file_size ≈ target_file_bytes / row_bytes`. Sanity-check `expected_files = ceil(total_rows / loader_file_size)`, aim for 20–200 files per very large table.
3. **Bound memory via `page_size`.** Estimate page memory as `page_size × row_bytes`. Choose `page_size` so peak RSS stays well under pod memory. Start at 50k–100k; drop to 25k–50k for OOM-prone or wide-row pipelines.
4. **Set parallelism from available CPU.** `extract_parallelism` and `LOAD__WORKERS` near 60–80% of allocatable vCPU is a good starting point. Keep `NORMALIZE__WORKERS` low (1–2). Use `LOAD__PARALLELISM_STRATEGY: parallel` for lakehouse and most warehouses; `sequential` for fragile or strictly-locked destinations.
5. **Validate on a slice.** Use `sql_limit` or a tight `interval_start`/`interval_end` window to confirm sizing, memory headroom, and file count on a partial run before scaling to the full dataset.
6. **Iterate by bottleneck:**
   * Long commit tail → raise `loader_file_size`.
   * Low CPU, long runtime → raise `extract_parallelism` and/or `LOAD__WORKERS`.
   * Source throttling → lower `extract_parallelism` or `page_size`.
   * Memory spikes → lower parallelism first; only then lower `page_size`.
7. **Add partitioning last.** Choose `partition_by` aligned with consumer query patterns. Confirm each partition still hits a healthy file size, high-cardinality partitions that produce tiny files cost more than they save.

## Anti-patterns

* **Using `merge` without a reliable `primary_key`**: duplicates rows or fails outright.
* **Putting optimisation options under the wrong section**: e.g., `incremental_strategy` under `source.options`. The stack schema rejects unknown sink-options strictly (`additionalProperties: false`); source-side errors are more subtle (silently ignored extras).
* **Adding multiple `partition_by` rules pre-emptively**: partitions cost write-side performance for no read-side benefit until consumers actually filter on them.
* **Preserving deep nesting (`max_table_nesting: "2"`+) on SQL/BI-facing datasets**: most BI tools struggle with nested columns.
* **Increasing `extract_parallelism` aggressively before validating source stability**: the source can throttle silently and degrade extraction quality without obvious errors.
* **Reducing `loader_file_size` to "spread the load"**: typically produces hundreds of tiny files, which slows commits and destination metadata.
* **Setting `full_refresh: true` on a scheduled pipeline**: it forces a full re-extract every run; intended for one-off resets only.
* **Treating `incremental_strategy: replace` and `full_refresh: true` as equivalent**: they are not. See the `full_refresh` knob in [Correctness Knobs](/concepts/resources/nilus/pipeline-optimization/optimize-sink-datasets/optimize-correctness-knobs.md).
* **Tuning configuration to compensate for under-sized compute**: config can reduce overhead, but cannot overcome insufficient CPU/RAM/IO. Right-size the runtime first.

## Related docs

* [Tuning Large Lakehouse (Iceberg) Loads](/concepts/resources/nilus/pipeline-optimization/optimize-sink-datasets/optimize-lakehouse-iceberg-loads.md): benchmark-backed configurations for large Iceberg loads, file-size decisions, volume setup, and table maintenance.
* [Understanding Batch Pipeline Config](/concepts/resources/nilus/batch/pipeline-config.md): full batch pipeline reference.
* [Understanding CDC Pipeline Config](/concepts/resources/nilus/cdc/service-config.md): full CDC pipeline reference.
* [Understanding Stream Pipeline Config](/concepts/resources/nilus/stream.md): full stream pipeline reference.
* [Batch Sample Configs](/concepts/resources/nilus/batch/sample-configs.md), [CDC Sample Configs](/concepts/resources/nilus/cdc/sample-configs.md), and [Stream Sample Configs](/concepts/resources/nilus/stream/sample-configs.md): copy-pasteable starting points.
* [Working with PostgreSQL Partitioned Tables for CDC](/concepts/resources/nilus/troubleshooting/postgresql-cdc-partitioned-tables.md): connector-specific runbook for PG partitioned sources.
* [Snowflake Key-Pair Authentication](/concepts/resources/nilus/troubleshooting/snowflake-key-pair-authentication.md): connector-specific runbook for Snowflake auth.
* [Sources](/concepts/resources/nilus/sources.md) and [Destinations](/concepts/resources/nilus/destinations.md): per-connector pages with their own troubleshooting and incremental-strategy support tables.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://v2.dataos.info/concepts/resources/nilus/pipeline-optimization/optimize-sink-datasets.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
