> For the complete documentation index, see [llms.txt](https://v2.dataos.info/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://v2.dataos.info/concepts/resources/nilus/pipeline-optimization/optimizing-for-time.md).

# Optimizing for Time

Use this guide when a pipeline is correct but does not finish inside the expected window.

## Find the bottleneck

| Signal                                      | Likely bottleneck                                     | Next check                                                                                           |
| ------------------------------------------- | ----------------------------------------------------- | ---------------------------------------------------------------------------------------------------- |
| High duration and low records/sec           | Source extraction or destination loading.             | Compare stage timing and destination-side load logs.                                                 |
| Extract stage dominates                     | Source pagination, source throttling, or query shape. | Try a bounded table, narrower object, or connector-specific filters.                                 |
| Load stage dominates                        | Destination write throughput or file layout.          | Review `DATA_WRITER__FILE_MAX_BYTES`, `loader_file_size`, destination permissions, and staging path. |
| CPU is high with steady throughput          | Compute-bound processing.                             | Increase compute only if the workload scales with it.                                                |
| CPU and memory are low but duration is high | Waiting on source, network, or destination.           | Check source API/warehouse limits and destination load queues.                                       |

## Tuning order

1. Confirm the row volume and source object are expected.
2. Use `incremental_key` and bounded intervals where supported.
3. Increase `page_size` only if memory remains healthy.
4. Increase `extract_parallelism` only for connectors where parallel extraction is supported.
5. Tune `DATA_WRITER__FILE_MAX_BYTES` (Iceberg destinations) or `loader_file_size` to avoid excessive tiny files or very large loader batches.
6. Increase load or normalization workers only when destination and compute capacity can use the extra concurrency.
7. Recheck destination query behavior after changing partitions or clustering.

## Lakehouse / Iceberg loads: size files by table shape

For Lakehouse (Iceberg) destinations the dominant time factor on large loads is file size, governed by `DATA_WRITER__FILE_MAX_BYTES` (a per-file byte cap, set via `spec.use.projection.projections.envVars`).

| Table shape             | Will it be queried?      | File cap                 | Why                                                         |
| ----------------------- | ------------------------ | ------------------------ | ----------------------------------------------------------- |
| Wide (100+ columns)     | Yes                      | Default 128 MB or 512 MB | Keeps Iceberg per-file stats for pruning.                   |
| Wide (100+ columns)     | No (archive / full-scan) | 2 GB                     | Fewer commits, fastest write; costs more memory.            |
| Narrow (high row count) | Either                   | 512 MB                   | Larger files give no time benefit and can be slower on AWS. |

At 512 MB, Iceberg still records per-file column statistics effectively. Going past 1 GB starts to widen per-file value ranges and weakens pruning. Use the 2 GB cap only when the table will never be queried. For the full benchmark and volume YAML, see [Tuning Large Lakehouse (Iceberg) Loads](/concepts/resources/nilus/pipeline-optimization/optimize-sink-datasets/optimize-lakehouse-iceberg-loads.md).

## Knob reference

The following settings control time-optimized pipeline runs.

### `DATA_WRITER__FILE_MAX_BYTES` (env var)

| Goes under                                | Default              | Type            |
| ----------------------------------------- | -------------------- | --------------- |
| `spec.use.projection.projections.envVars` | `134217728` (128 MB) | integer (bytes) |

The primary Iceberg file-size lever. Controls the byte threshold at which Nilus closes the current Parquet file and starts a new one. Set `loader_file_size` high (e.g. `2000000000`) so the byte cap is reached first, not the row count.

Common values: `134217728` (128 MB, default), `536870912` (512 MB, queryable balanced), `2147483648` (2 GB, non-queryable fastest).

### `staging_bucket` (sink)

| Goes under     | Default | Type                                        |
| -------------- | ------- | ------------------------------------------- |
| `sink.options` | unset   | string, must be `gs://` or `s3://` prefixed |

External staging bucket for warehouses that benefit from external-stage loads (BigQuery, Snowflake, Redshift). Without it, the loader streams rows directly into the warehouse, which is bandwidth-bound and slow for very large extracts. The bucket must be writable by the destination's service principal. For BigQuery, the bucket must be in the same region as the destination dataset.

### `LOAD__WORKERS` (env var)

| Goes under                                | Default                      | Type    |
| ----------------------------------------- | ---------------------------- | ------- |
| `spec.use.projection.projections.envVars` | engine default (typically 5) | integer |

This is the single biggest throughput knob for write-heavy pipelines once `loader_file_size` and `loader_file_format` are sensible. Raise until CPU or object-store bandwidth saturates, then back off one notch.

### `NORMALIZE__WORKERS` (env var)

| Goes under                                | Default                      | Type    |
| ----------------------------------------- | ---------------------------- | ------- |
| `spec.use.projection.projections.envVars` | engine default (typically 1) | integer |

Concurrency for normalization / type coercion between extract and load. Most pipelines do not need to raise this beyond 1 to 2.

### `EXTRACT__WORKERS` (env var)

| Goes under                                | Default        | Type    |
| ----------------------------------------- | -------------- | ------- |
| `spec.use.projection.projections.envVars` | engine default | integer |

Extraction-side worker count. Raise alongside `LOAD__WORKERS` for balanced throughput on large Iceberg loads.

### `EXTRACT__MAX_PARALLEL_ITEMS` (env var)

| Goes under                                | Default        | Type    |
| ----------------------------------------- | -------------- | ------- |
| `spec.use.projection.projections.envVars` | engine default | integer |

Maximum number of source items processed in parallel during extraction. Useful for connectors that support parallel object reads.

### `LOAD__PARALLELISM_STRATEGY` (env var)

| Goes under                                | Default                | Valid values             |
| ----------------------------------------- | ---------------------- | ------------------------ |
| `spec.use.projection.projections.envVars` | engine default (unset) | `parallel`, `sequential` |

How the loader schedules load units across workers. Use `parallel` for Lakehouse and warehouse destinations that handle concurrency well. Use `sequential` for fragile destinations or those with strict locking semantics.

### Passing env vars via projection

All runtime env vars above are set through `spec.use.projection.projections.envVars`:

```yaml
spec:
  use:
    projection:
      projections:
        envVars:
          - key: DATA_WRITER__FILE_MAX_BYTES
            template: "536870912"
          - key: LOAD__WORKERS
            template: "16"
          - key: EXTRACT__WORKERS
            template: "8"
          - key: EXTRACT__MAX_PARALLEL_ITEMS
            template: "16"
          - key: LOAD__PARALLELISM_STRATEGY
            template: "parallel"
```

## Sampling for faster validation

Use these during testing, validation, or partial backfills. Do not set them on steady-state schedules.

### `sql_limit`

Caps the total rows extracted per run. Applied as a `LIMIT` clause on SQL sources. Common range: `1000` to `100000`.

### `yield_limit`

Caps the number of pages yielded by the source. Page-level analogue of `sql_limit`. Works for non-SQL sources too. If `page_size: 50000` and `yield_limit: 4`, the source yields at most 200,000 rows.

## Avoid these shortcuts

* Do not use `sql_limit` as a production speed fix. It is a sampling and testing knob.
* Do not raise every concurrency knob at once.
* Do not optimize a failed run before reading the error.
* Do not switch from `merge` to `append` just to reduce runtime unless downstream consumers explicitly want history instead of current state.
* Do not run a full-table `merge` on a very large Lakehouse table. PyIceberg row-level upserts make it extremely slow. Use an incremental `merge` that loads only changed rows per run.

## Related docs

* [Pipeline Optimization](/concepts/resources/nilus/pipeline-optimization.md)
* [Optimizing for Resource](/concepts/resources/nilus/pipeline-optimization/optimizing-for-resource.md)
* [Optimize Sink Datasets](/concepts/resources/nilus/pipeline-optimization/optimize-sink-datasets.md)
* [Tuning Large Lakehouse (Iceberg) Loads](/concepts/resources/nilus/pipeline-optimization/optimize-sink-datasets/optimize-lakehouse-iceberg-loads.md)
* [Grafana Dashboards](/concepts/resources/nilus/observability/grafana-dashboard.md)


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://v2.dataos.info/concepts/resources/nilus/pipeline-optimization/optimizing-for-time.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
