> For the complete documentation index, see [llms.txt](https://v2.dataos.info/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://v2.dataos.info/concepts/resources/nilus/pipeline-optimization/optimizing-for-resource.md).

# Optimizing for Resource

Use this guide when a pipeline runs correctly but consumes too much CPU, memory, or destination capacity.

## Resource symptoms

| Symptom                            | Likely cause                                                                                       | First response                                                                                              |
| ---------------------------------- | -------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------- |
| Memory spikes during extraction    | Pages or batches are too large for the data shape.                                                 | Reduce `page_size`; inspect nested or wide columns.                                                         |
| Memory spikes during load          | Loader batches are too large or `DATA_WRITER__FILE_MAX_BYTES` is set too high for the table shape. | Reduce `loader_file_size` or lower `DATA_WRITER__FILE_MAX_BYTES`; review nested fields.                     |
| CPU is high and throughput is high | The pipeline may simply be busy.                                                                   | Check whether runtime is acceptable before tuning.                                                          |
| CPU is high and throughput is low  | Backpressure, expensive normalization, or retries.                                                 | Check logs and stage-level metrics.                                                                         |
| Destination cost rises             | File count, partitions, clustering, or write strategy may be inefficient.                          | Review [Optimize Sink Datasets](/concepts/resources/nilus/pipeline-optimization/optimize-sink-datasets.md). |
| OOMKilled on a large Iceberg load  | Ephemeral pod disk is too small; staging files fill it and the pod is evicted.                     | Attach a DataOS Volume for scratch space. See below.                                                        |

## Tuning order

1. Lower `page_size` before increasing memory limits.
2. Lower `extract_parallelism` if parallel reads overwhelm memory or source limits.
3. Lower `loader_file_size` or `DATA_WRITER__FILE_MAX_BYTES` if individual load batches are too large.
4. Reduce nested expansion with `max_table_nesting` when the source emits wide nested records.
5. Confirm `incremental_strategy` is appropriate. Rewriting a large table on every run is expensive.
6. Scale compute only after config-level changes are understood.

## Lakehouse / Iceberg loads: when to reduce file size

For Lakehouse (Iceberg) destinations with wide or high-volume tables, a large `DATA_WRITER__FILE_MAX_BYTES` cap reduces the number of Iceberg commits but significantly increases peak memory. For a wide 10M-row x 400-column table (19.1 GB), moving from the default 128 MB cap to 2 GB raises peak memory from \~4.2 GB to \~12.9 GB.

Use the resource-optimized profile (no explicit file-size or worker configuration) when time is not critical. For narrow tables (high row count, few columns), the resource-optimized profile is rarely worth it: it saves only \~0.2-0.5 GB of memory while adding 34 minutes to 1h 49min of extra runtime. See [Tuning Large Lakehouse (Iceberg) Loads](/concepts/resources/nilus/pipeline-optimization/optimize-sink-datasets/optimize-lakehouse-iceberg-loads.md) for the full comparison.

## Knob reference

The following settings reduce peak memory and CPU usage.

### `page_size` (source)

| Goes under       | Default                 | Type    |
| ---------------- | ----------------------- | ------- |
| `source.options` | `50000` (rows per page) | integer |

Controls the in-memory writer buffer (rows held before flush). The dominant memory consumer during extraction. Despite the name, it does not control source-side pagination for SQL sources. It controls the writer buffer.

| Symptom                                        | Direction                                  | Why                                                   |
| ---------------------------------------------- | ------------------------------------------ | ----------------------------------------------------- |
| Pipeline is OOMKilled                          | Lower `page_size` (try `25000` or `10000`) | Smaller buffer per page.                              |
| Source is hammered with too many small queries | Raise `page_size` (try `100000`+)          | Fewer, larger reads.                                  |
| Wide rows (1 KB+) blow the runtime memory      | Lower `page_size`                          | Page memory bound is roughly `page_size x row_bytes`. |

Tuning range: `10000` to `150000` is normal.

### `extract_parallelism` (source)

| Goes under       | Default | Type    |
| ---------------- | ------- | ------- |
| `source.options` | `5`     | integer |

Number of concurrent extraction workers per pipeline. Each worker holds its own page buffer. Lower this first if memory spikes during extraction.

Tuning range: `2` to `8` for most pipelines.

### Sampling to cut width and startup cost

For testing and validation, limiting the columns extracted can significantly reduce memory and startup time.

**`sql_exclude_columns`** drops specific columns during extraction. Use it to remove large blob or debug columns that are not needed downstream:

```yaml
source:
  options:
    sql_exclude_columns: "debug_payload,raw_html"
```

Prefer `sql_exclude_columns` over `type_hints: { col: null }` for excluding wide or blob-heavy columns. `sql_exclude_columns` removes them from the SELECT before Nilus reads any data; `type_hints` still fetches them and applies type coercion.

**`sql_reflection_level`**: controls how thoroughly Nilus reflects the source schema before extraction.

| Goes under       | Default | Valid values      |
| ---------------- | ------- | ----------------- |
| `source.options` | `full`  | `full`, `limited` |

`full` (the default) inspects every column, type, and constraint. `limited` reduces reflection depth on very wide source tables (1000+ columns), which speeds up extraction startup at the cost of less accurate type inference. This is typically fine when `type_hints` is set explicitly for the columns that matter.

## Attaching a persistent volume for large Iceberg loads

For large Lakehouse loads, staging files can exhaust the pod's ephemeral disk and cause eviction. A DataOS Volume gives the pipeline durable scratch space. Attach one by declaring it under `spec.use.volumes` and pointing `DATAOS_PERSISTENT_DIR` at the same mount path:

```yaml
spec:
  use:
    volumes:
      - id: <volume-id>
        directory: /var/dataos/public/nilus_scratch
        readOnly: false
    projection:
      projections:
        envVars:
          - key: DATAOS_PERSISTENT_DIR
            template: "/var/dataos/public/nilus_scratch"
```

Size the volume to at least a few multiples of your `DATA_WRITER__FILE_MAX_BYTES` cap x `LOAD__WORKERS`. See [Persistent volumes and restricted runtimes](/concepts/resources/nilus/batch/pipeline-config.md#4-persistent-volumes-and-restricted-runtimes) and [Tuning Large Lakehouse (Iceberg) Loads](/concepts/resources/nilus/pipeline-optimization/optimize-sink-datasets/optimize-lakehouse-iceberg-loads.md) for the full setup.

## Resource sizing checklist

* Set the memory limit higher than the observed peak with a practical buffer.
* Do not let CPU limits throttle a run that is otherwise healthy.
* Do not let scheduled pipelines overlap unless concurrency is intended.
* Match destination write capacity to the configured load concurrency.
* For large Iceberg loads, attach a persistent volume before raising `DATA_WRITER__FILE_MAX_BYTES` past the default.

## Related docs

* [Pipeline Optimization](/concepts/resources/nilus/pipeline-optimization.md)
* [Optimizing for Time](/concepts/resources/nilus/pipeline-optimization/optimizing-for-time.md)
* [Tuning Large Lakehouse (Iceberg) Loads](/concepts/resources/nilus/pipeline-optimization/optimize-sink-datasets/optimize-lakehouse-iceberg-loads.md)
* [Observability](/concepts/resources/nilus/observability.md)
* [Checking Logs](/concepts/resources/nilus/troubleshooting/checking-logs.md)


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://v2.dataos.info/concepts/resources/nilus/pipeline-optimization/optimizing-for-resource.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
