> For the complete documentation index, see [llms.txt](https://v2.dataos.info/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://v2.dataos.info/concepts/resources/nilus/pipeline-optimization/optimize-sink-datasets/optimize-throughput-knobs.md).

# Throughput Knobs

*Part of* [*Optimize Sink Datasets*](/concepts/resources/nilus/pipeline-optimization/optimize-sink-datasets.md)*.*

These knobs decide *how quickly and how cheaply* a run completes. Tune these only after you have a concrete bottleneck. Pre-emptive tuning is rarely worth it.

## `loader_file_format` (sink)

| Goes under     | Default                              | Valid values                               |
| -------------- | ------------------------------------ | ------------------------------------------ |
| `sink.options` | conditional auto-default (see below) | `parquet`, `jsonl`, `csv`, `insert_values` |

Controls the staging file format the loader writes before committing to the destination.

**Auto-default behavior:** if `loader_file_format` is unset, Nilus picks `parquet` automatically when the destination scheme is one of `duckdb`, `snowflake`, `databricks`, `synapse`, `s3`. For every other destination (BigQuery, DataOS Lakehouse, ClickHouse, Redshift, MS SQL Server, MongoDB, Elasticsearch, AWS S3 file-system, PostgreSQL, MySQL), the loader uses the destination's native default, frequently JSONL or insert-values, which is materially slower for analytic workloads.

{% hint style="info" %}
**Set `loader_file_format: parquet` explicitly when writing to BigQuery or DataOS Lakehouse.** Both are happy with parquet; both default to slower formats otherwise.
{% endhint %}

| Format          | Use it for                                                                              | Notes                                                                                                                                                     |
| --------------- | --------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `parquet`       | Lakehouse (DataOS Lakehouse, S3, GCS), Snowflake, BigQuery, Databricks, Synapse, DuckDB | Columnar, compressed, far smaller and faster than JSONL for analytic targets.                                                                             |
| `jsonl`         | Document-oriented destinations (MongoDB, Elasticsearch); JSON-returning REST sources    | Preserves nested structures losslessly.                                                                                                                   |
| `csv`           | DuckDB, S3 (CSV consumer downstream)                                                    | Verbose; only when downstream tools require CSV.                                                                                                          |
| `insert_values` | Relational destinations (PostgreSQL, MySQL, MS SQL Server, Redshift)                    | Inline `INSERT … VALUES` statements. Fine for small tables; use `parquet` + warehouse external-stage for large tables on Snowflake / BigQuery / Redshift. |

{% hint style="info" %}
**BigQuery + JSON-returning source caveat.** When the destination is BigQuery and the source is a JSON-returning connector (currently: `notion`), Nilus disables parquet auto-default to avoid a BigQuery/parquet schema-evolution edge case. Set `loader_file_format` explicitly only if you want to override that fallback.
{% endhint %}

## `loader_file_size` (sink)

| Goes under     | Default                  | Type    |
| -------------- | ------------------------ | ------- |
| `sink.options` | `100000` (rows per file) | integer |

How many rows the loader writes per output file before rotating to the next one.

Controls the loader's per-file row count (the rotation threshold). The job-count metric you see in logs is "X pages of `page_size` rows". The file count is `ceil(total_rows / loader_file_size)`.

| Symptom                                                           | Direction                    | Why                                             |
| ----------------------------------------------------------------- | ---------------------------- | ----------------------------------------------- |
| Long commit tail (jobs at 100% but the run continues for minutes) | **Raise** `loader_file_size` | Fewer files → fewer Iceberg/manifest commits.   |
| Single-file retries are slow on transient network failures        | **Lower** `loader_file_size` | Each retry re-uploads less data.                |
| Destination has thousands of tiny files                           | **Raise** `loader_file_size` | One target Parquet file should be \~128–512 MB. |

### Sizing heuristic

1. Estimate per-row size in bytes: `row_bytes ≈ raw_table_size_bytes / total_rows`.
2. Pick a target Parquet file size: 128–512 MB is healthy on object storage.
3. `loader_file_size ≈ target_file_bytes / row_bytes`.
4. Sanity-check: `expected_files ≈ ceil(total_rows / loader_file_size)`. Aim for 20–200 files per very large table to keep commit time low.

For wide rows (1 KB+), 200k–350k rows per file is a good starting point.

## `page_size` (source)

| Goes under       | Default                 | Type    |
| ---------------- | ----------------------- | ------- |
| `source.options` | `50000` (rows per page) | integer |

How many rows are buffered in memory per extraction page before being flushed to the loader.

Controls the in-memory writer buffer (rows held before flush). Despite the name, it does **not** control source-side pagination for SQL sources (the SQL backend uses its own chunking). It controls the **writer buffer**, which is the dominant memory consumer during extraction.

| Symptom                                        | Direction                                                                                        | Why                                                   |
| ---------------------------------------------- | ------------------------------------------------------------------------------------------------ | ----------------------------------------------------- |
| Pipeline is `OOMKilled`                        | **Lower** `page_size` (try `25000` or `10000`)                                                   | Smaller buffer per page.                              |
| Source is hammered with too many small queries | **Raise** `page_size` (try `100000`+)                                                            | Fewer, larger reads.                                  |
| Wide rows (1 KB+) blow the runtime memory      | **Lower** `page_size` to keep `page_size × row_bytes` under a comfortable fraction of pod memory | Page memory bound is roughly `page_size × row_bytes`. |

Tuning range: `10000` to `150000` is normal. Going higher requires a chunky source (slow connection setup) and ample runtime memory.

## `extract_parallelism` (source)

| Goes under       | Default | Type    |
| ---------------- | ------- | ------- |
| `source.options` | `5`     | integer |

Number of concurrent extraction workers per pipeline.

Controls the extractor's worker count and per-worker parallelism.

| Symptom                                          | Direction                                             | Why                                    |
| ------------------------------------------------ | ----------------------------------------------------- | -------------------------------------- |
| CPU is low, run is long                          | **Raise** `extract_parallelism` toward available vCPU | More concurrent reads.                 |
| Source database throttles or rejects connections | **Lower** `extract_parallelism`                       | Fewer concurrent reads.                |
| Memory spikes during extraction                  | **Lower** `extract_parallelism`                       | Each worker holds its own page buffer. |

Tuning range: `2` to `8` for most pipelines. Higher only if the source is concurrency-safe and the runtime has the CPU/memory headroom.

## `staging_bucket` (sink)

| Goes under     | Default | Type                                        |
| -------------- | ------- | ------------------------------------------- |
| `sink.options` | unset   | string, must be `gs://` or `s3://` prefixed |

External staging bucket for warehouses that benefit from external-stage loads (BigQuery, Snowflake, Redshift). Without it, the loader streams rows directly into the warehouse, which is bandwidth-bound and slow for very large extracts.

```yaml
spec:
  sink:
    address: bigquery://my-project?credentials_path=/path/key.json
    options:
      dest_table: dataset.big_table
      incremental_strategy: append
      loader_file_format: parquet
      loader_file_size: 250000
      staging_bucket: gs://my-staging-bucket
```

The bucket must be writable by the destination's service principal (BigQuery service account, Snowflake `STAGE`, etc.). For BigQuery specifically, the bucket must live in the **same region** as the destination dataset.

## `DATA_WRITER__FILE_MAX_BYTES` (env var)

| Goes under                                | Default              | Type            |
| ----------------------------------------- | -------------------- | --------------- |
| `spec.use.projection.projections.envVars` | `134217728` (128 MB) | integer (bytes) |

Per-file byte cap for Iceberg (Lakehouse) writes. Controls the byte threshold at which Nilus closes the current Parquet file and starts a new one. The primary file-size lever for DataOS Lakehouse destinations.

When you want this byte cap to govern, set `loader_file_size` high (e.g. `2000000000`) so the row-count threshold is never reached first.

| Symptom                                                       | Direction                   | Why                                            |
| ------------------------------------------------------------- | --------------------------- | ---------------------------------------------- |
| Long commit tail (jobs at 100% but run continues for minutes) | Raise toward 512 MB         | Fewer files, fewer Iceberg manifest commits.   |
| Peak memory too high on a Lakehouse load                      | Lower toward 128 MB         | Smaller open file buffers.                     |
| Destination has thousands of tiny files                       | Raise toward 512 MB or 2 GB | One healthy Parquet file should be 128-512 MB. |

Common values: `134217728` (128 MB, default), `536870912` (512 MB, queryable), `2147483648` (2 GB, non-queryable fastest). For benchmark results by table shape, see [Tuning Large Lakehouse (Iceberg) Loads](/concepts/resources/nilus/pipeline-optimization/optimize-sink-datasets/optimize-lakehouse-iceberg-loads.md).

## `LOAD__WORKERS` (env var)

| Goes under                                | Default                      | Type    |
| ----------------------------------------- | ---------------------------- | ------- |
| `spec.use.projection.projections.envVars` | engine default (typically 5) | integer |

Loader concurrency: how many parallel workers stage, write, and commit output files. Different from `extract_parallelism`, which governs the read side.

This is the single biggest throughput knob for write-heavy pipelines once `loader_file_size` and `loader_file_format` are sensible. Raise until CPU or object-store bandwidth saturates, then back off one notch.

## `EXTRACT__WORKERS` (env var)

| Goes under                                | Default        | Type    |
| ----------------------------------------- | -------------- | ------- |
| `spec.use.projection.projections.envVars` | engine default | integer |

Extraction-side worker count. Raise alongside `LOAD__WORKERS` for balanced throughput on large Iceberg loads.

## `EXTRACT__MAX_PARALLEL_ITEMS` (env var)

| Goes under                                | Default        | Type    |
| ----------------------------------------- | -------------- | ------- |
| `spec.use.projection.projections.envVars` | engine default | integer |

Maximum number of source items processed in parallel during extraction. Useful for connectors that support parallel object reads.

## `NORMALIZE__WORKERS` (env var)

| Goes under                                | Default                      | Type    |
| ----------------------------------------- | ---------------------------- | ------- |
| `spec.use.projection.projections.envVars` | engine default (typically 1) | integer |

Concurrency for normalization / type coercion between extract and load. Most pipelines do not need to raise this beyond 1 to 2. Raise only if profiling shows normalization is the bottleneck (rare).

## `LOAD__PARALLELISM_STRATEGY` (env var)

| Goes under                                | Default                | Valid values             |
| ----------------------------------------- | ---------------------- | ------------------------ |
| `spec.use.projection.projections.envVars` | engine default (unset) | `parallel`, `sequential` |

How the loader schedules load units (tables/resources) across workers.

| Strategy     | Use it for                                                                                                                                                |
| ------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `parallel`   | Lakehouse and warehouse destinations that handle concurrency well (DataOS Lakehouse, BigQuery, Snowflake, Databricks).                                    |
| `sequential` | Fragile destinations or destinations with strict locking semantics (some MS SQL Server / Synapse setups, Elasticsearch with conservative cluster sizing). |

Combined env-var example:

```yaml
spec:
  use:
    projection:
      projections:
        envVars:
          - key: DATA_WRITER__FILE_MAX_BYTES
            template: "536870912"
          - key: LOAD__WORKERS
            template: "8"
          - key: EXTRACT__WORKERS
            template: "8"
          - key: EXTRACT__MAX_PARALLEL_ITEMS
            template: "16"
          - key: NORMALIZE__WORKERS
            template: "2"
          - key: LOAD__PARALLELISM_STRATEGY
            template: "parallel"
```


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://v2.dataos.info/concepts/resources/nilus/pipeline-optimization/optimize-sink-datasets/optimize-throughput-knobs.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
