> For the complete documentation index, see [llms.txt](https://v2.dataos.info/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://v2.dataos.info/concepts/resources/nilus/pipeline-optimization/optimize-sink-datasets/optimize-lakehouse-iceberg-loads.md).

# Tuning Large Lakehouse (Iceberg) Loads

Lakehouse destinations commit data as Iceberg snapshots built from Parquet files. For large or wide tables this load path is the most memory-sensitive part of a Nilus pipeline: the runtime holds file metadata and write buffers in memory while it computes and commits each snapshot. This guide gives validated, benchmark-backed configurations for getting large Iceberg loads to finish fast and stay within a sane memory budget.

The numbers on this page were measured on Nilus `2.0.21` against AWS and Azure DataOS Lakehouse destinations, using two reference datasets:

* **Wide table**: 10M rows x 400 columns (\~19.1 GB).
* **Narrow table**: 150M rows x 8 columns (\~20.99 GB).

{% hint style="info" %}
This page is specific to Lakehouse / Iceberg destinations. For the full cross-destination knob reference, see [Optimize Sink Datasets](/concepts/resources/nilus/pipeline-optimization/optimize-sink-datasets.md). For generic time and resource tuning, see [Optimizing for Time](/concepts/resources/nilus/pipeline-optimization/optimizing-for-time.md) and [Optimizing for Resource](/concepts/resources/nilus/pipeline-optimization/optimizing-for-resource.md).
{% endhint %}

## Answer three questions first

Pick a configuration by answering these in order:

1. **What matters most: wall-clock time, or pod CPU/memory footprint?**
2. **If time matters: will downstream engines (Trino, Spark, Presto) query this table, or is it write-only / archive-only?**
3. **Is the table wide or narrow?**
   * **Wide**: many columns (100+), fewer rows relative to column count. File size grows quickly per row.
   * **Narrow**: few columns (under \~20) but very high row counts (100M+). File size grows slowly per row.

These three answers map directly to one of the configurations below.

## File size and queryability

Iceberg records per-column statistics (row counts, null counts, and min/max value bounds) for every data file in its manifests. Query engines use those bounds to skip files that cannot match a predicate. File size does not turn statistics on or off. It changes how useful they are:

* **Smaller files** carry tighter (narrower) per-file value ranges and give engines more granular file pruning and more read parallelism. The Nilus default per-file cap is 128 MB.
* **Larger files** mean fewer Iceberg catalog commits and faster writes, but each file spans a wider value range, so fewer files can be pruned, and they cost significantly more peak memory and reduce read parallelism.

The industry sweet spot for analytic tables is 128-512 MB per file (the Iceberg `write.target-file-size-bytes` default is 512 MB). Files at or below \~512 MB keep pruning effective. Very large full-scan-only tables sometimes go to 1 GB+. The 2 GB cap in the non-queryable profile below is a deliberate step past this range; it trades query pruning for the fewest commits.

{% hint style="info" %}
**File size is the secondary lever. Data layout is the primary one.** Min/max pruning only helps when rows with similar filter-column values are physically co-located: otherwise every file's range overlaps every filter and nothing is skipped. Partition on the columns your consumers filter on, and cluster within files, using `partition_by` and `cluster_by` (see [Shape Knobs](/concepts/resources/nilus/pipeline-optimization/optimize-sink-datasets/optimize-shape-knobs.md)). Get layout right first, then size files.
{% endhint %}

The per-file cap is controlled by `DATA_WRITER__FILE_MAX_BYTES` (a byte cap). `loader_file_size` is a row-count rotation threshold that also bounds file size. When you want the byte cap to govern, set `loader_file_size` high enough (e.g. `2000000000`) that the byte cap is reached first.

{% hint style="info" %}
**Very wide tables (100+ columns):** Iceberg collects column statistics for only the first \~100 columns by default (`write.metadata.metrics.max-inferred-column-defaults`). Columns past that limit have no min/max bounds, so filtering on them cannot prune files regardless of file size. If consumers filter on far-right columns of a very wide table, order the schema so those columns fall within the metrics range, or have the table owner widen the metrics default.
{% endhint %}

## 1. Optimize for time

Two profiles exist depending on whether the destination table needs to stay queryable.

### 1a. Fast: table must stay queryable

Use when downstream consumers query the Iceberg table and need good pruning. Files stay within the 512 MB queryable range.

```yaml
spec:
  use:
    projection:
      projections:
        envVars:
          - key: LOAD__WORKERS
            template: "16"
          - key: EXTRACT__WORKERS
            template: "8"
          - key: EXTRACT__MAX_PARALLEL_ITEMS
            template: "16"
```

No `DATA_WRITER__FILE_MAX_BYTES` is needed; the 128 MB default is already queryable. For slightly fewer files while staying queryable, set it to `536870912` (512 MB).

| Dataset                     | Destination | Pipeline time | Peak memory |
| --------------------------- | ----------- | ------------- | ----------- |
| Wide (10M x 400, 19.1 GB)   | AWS LH      | \~54 min      | \~6 GB      |
| Wide (10M x 400, 19.1 GB)   | Azure LH    | \~1h 6min     | \~5.4 GB    |
| Narrow (150M x 8, 20.99 GB) | AWS LH      | \~41 min      | \~2.1 GB    |
| Narrow (150M x 8, 20.99 GB) | Azure LH    | \~1h 25min    | \~1.6 GB    |

For narrow tables, set the cap to 512 MB explicitly. It produces the fewest files without hurting query pruning, and it is already the fastest option (see 1b).

### 1b. Fastest: table does NOT need to be queryable

Use when the output is write-only: raw archiving or downstream bulk processing that full-scans the table. Larger files mean fewer commits and faster writes, at the cost of much higher peak memory and weaker pruning if the table is later queried.

**Wide tables** benefit from a large (2 GB) cap:

```yaml
spec:
  use:
    projection:
      projections:
        envVars:
          - key: DATA_WRITER__FILE_MAX_BYTES
            template: "2147483648"
          - key: LOAD__WORKERS
            template: "16"
          - key: EXTRACT__WORKERS
            template: "8"
          - key: EXTRACT__MAX_PARALLEL_ITEMS
            template: "16"
  sink:
    options:
      loader_file_size: 2000000000
```

**Narrow tables** gain nothing past 512 MB. Keep the 1a config (512 MB cap). Going larger does not help and can be slower on AWS.

| Dataset | Destination | Pipeline time | Peak memory | Time saved vs 1a | Extra memory vs 1a |
| ------- | ----------- | ------------- | ----------- | ---------------- | ------------------ |
| Wide    | AWS LH      | \~38 min      | \~12.9 GB   | -16 min          | +6.9 GB            |
| Wide    | Azure LH    | \~1h 2min     | \~11.4 GB   | -4 min           | +6 GB              |
| Narrow  | AWS LH      | \~41 min      | \~2.1 GB    | no change        | no change          |
| Narrow  | Azure LH    | \~1h 23min    | \~1.4 GB    | -2 min           | no change          |

{% hint style="info" %}
For wide tables on AWS, the non-queryable config saves \~16 minutes, which is worth it when queryability is not needed. On Azure it saves only \~4 minutes for \~6 GB of extra memory, which is rarely worth it. For narrow tables there is no benefit to larger files at all.
{% endhint %}

## 2. Optimize for resource

Use when the pipeline is not time-critical (overnight / off-peak) or the cluster is shared and minimizing pod CPU/memory matters most. This runs on Nilus defaults, with no explicit worker or file-cap configuration.

```yaml
spec:
  source:
    options:
      page_size: 100000
```

| Dataset                     | Destination | Pipeline time | Peak memory |
| --------------------------- | ----------- | ------------- | ----------- |
| Wide (10M x 400, 19.1 GB)   | AWS LH      | \~57 min      | \~4.2 GB    |
| Wide (10M x 400, 19.1 GB)   | Azure LH    | \~1h 17min    | \~4.0 GB    |
| Narrow (150M x 8, 20.99 GB) | AWS LH      | \~2h 30min    | \~1.6 GB    |
| Narrow (150M x 8, 20.99 GB) | Azure LH    | \~1h 59min    | \~1.4 GB    |

**Trade-off vs the time-optimized 1a config:**

| Dataset | Cloud | Time delta       | Memory saved |
| ------- | ----- | ---------------- | ------------ |
| Wide    | AWS   | +3 min slower    | 1.8 GB less  |
| Wide    | Azure | +11 min slower   | 1.4 GB less  |
| Narrow  | AWS   | +1h 49min slower | 0.5 GB less  |
| Narrow  | Azure | +34 min slower   | 0.2 GB less  |

{% hint style="warning" %}
For **wide** tables the resource-optimized config is almost as fast as the time-optimized one (3-11 min penalty) for a meaningful memory saving. For **narrow** tables the penalty is severe (34 min to nearly 2 hours). Do not resource-optimize narrow tables unless time genuinely does not matter.
{% endhint %}

## Summary decision table

| Goal                    | Dataset | Config    | File cap         | Workers             | Expected time (AWS / Azure) |
| ----------------------- | ------- | --------- | ---------------- | ------------------- | --------------------------- |
| Fast + queryable        | Wide    | 1a        | Default (128 MB) | 16 load / 8 extract | 54 min / 1h 6min            |
| Fast + queryable        | Narrow  | 1a        | 512 MB           | 16 load / 8 extract | 41 min / 1h 25min           |
| Fastest, no query needs | Wide    | 1b        | 2 GB             | 16 load / 8 extract | 38 min / 1h 2min            |
| Fastest, no query needs | Narrow  | 1a (same) | 512 MB           | 16 load / 8 extract | 41 min / 1h 25min           |
| Minimize resources      | Wide    | 2         | Default          | Default             | 57 min / 1h 17min           |
| Minimize resources      | Narrow  | 2         | Default          | Default             | 2h 30min / 1h 59min         |

## Spilling to a persistent volume

When staging large Iceberg files, the pod's ephemeral filesystem may not be large enough; mounting a volume provides stable scratch space.

### Why a volume matters for Iceberg loads

Nilus does not stream rows straight into Iceberg. It stages intermediate Parquet files on the local filesystem, then commits them as an Iceberg snapshot. By default these staging files land on the pod's ephemeral filesystem, which is small, shared with the OS, and discarded when the pod restarts.

For large Lakehouse loads this is the failure point:

* **Disk exhaustion / eviction.** A wide-table load that buffers multi-hundred-MB-to-GB Parquet files can fill the ephemeral disk and get the pod evicted mid-run.
* **Memory pressure.** When a load is memory-bound, having durable, generously-sized scratch space on disk lets the runtime stage files there instead of being OOMKilled.

A DataOS Volume gives the pipeline durable, independently-sized scratch space decoupled from the pod's ephemeral disk. Pointing Nilus at it is the recommended pattern for any large Iceberg load.

### How to attach one

**1. Create or identify a DataOS Volume resource.** Have a platform operator provision a Volume resource (or reuse an existing one) sized for your largest expected staging footprint. As a rule of thumb, size it to at least a few multiples of your `DATA_WRITER__FILE_MAX_BYTES` cap x `LOAD__WORKERS`, so concurrent file writers all have room.

**2. Mount the volume and set `DATAOS_PERSISTENT_DIR` to the same path.** Reference the volume under `spec.use.volumes`, then set the `DATAOS_PERSISTENT_DIR` environment variable to the mount `directory` so Nilus stages files on the volume:

```yaml
spec:
  use:
    volumes:
      - id: <volume-id>
        directory: /var/dataos/public/nilus_scratch
        readOnly: false
    projection:
      projections:
        envVars:
          - key: DATAOS_PERSISTENT_DIR
            template: "/var/dataos/public/nilus_scratch"
```

| Field                   | Required        | Description                                                                                   |
| ----------------------- | --------------- | --------------------------------------------------------------------------------------------- |
| `id`                    | Yes             | Id of an existing DataOS Volume resource.                                                     |
| `directory`             | Yes             | In-container mount path for the volume.                                                       |
| `readOnly`              | No              | Defaults to `false`. Use `true` only for read-only reference mounts, never for scratch space. |
| `DATAOS_PERSISTENT_DIR` | Yes (for spill) | Must match `directory` so staging files are written to the volume.                            |

{% hint style="info" %}
The larger the per-file cap, the more scratch space a load needs. If you raise `DATA_WRITER__FILE_MAX_BYTES` (config 1b) and run many `LOAD__WORKERS`, size the volume accordingly, since several large Parquet files can be open at once.
{% endhint %}

### Full example: volume + 512 MB balanced profile

This is the validated configuration for the wide benchmark dataset (MSSQL to Azure Lakehouse, \~4.7 GB peak memory, best time-to-memory balance), with the volume wired up:

```yaml
name: mssql-az-lh-512mb
version: v1alpha
type: nilus
description: MSSQL to Azure Lakehouse, 512 MB file cap, volume-backed
spec:
  type: batch
  compute: <compute-name>
  logLevel: ERROR
  use:
    volumes:
      - id: <volume-id>
        directory: /var/dataos/public/mssql_az_512mb
        readOnly: false
    projection:
      projections:
        envVars:
          - key: DATAOS_PERSISTENT_DIR
            template: "/var/dataos/public/mssql_az_512mb"
          - key: DATA_WRITER__FILE_MAX_BYTES
            template: "536870912"
          - key: LOAD__WORKERS
            template: "16"
          - key: EXTRACT__WORKERS
            template: "8"
          - key: EXTRACT__MAX_PARALLEL_ITEMS
            template: "16"
  source:
    address: dataos://<mssql-depot>?purpose=rw
    options:
      source_table: "dbo.synthetic_wide_10m_400"
      page_size: 100000
  sink:
    address: dataos://<lakehouse-depot>?purpose=rw
    options:
      dest_table: <schema>.<table>
      incremental_strategy: append
```

See [Persistent volumes and restricted runtimes](/concepts/resources/nilus/batch/pipeline-config.md#4-persistent-volumes-and-restricted-runtimes) for the `spec.use.volumes` field contract.

## Ongoing Iceberg table maintenance

Write-time tuning controls the files Nilus produces. It does not maintain the table over its lifetime. An Iceberg table that is loaded repeatedly, especially with `merge` or CDC, accumulates small files, delete files, stale snapshots, and fragmented manifests that slowly degrade read performance no matter how well each load is sized. This maintenance is a destination-side responsibility and runs from the engine that owns the table (Spark, Trino, Dremio, etc.), not from the Nilus pipeline:

* **Compaction** (`rewrite_data_files`) bin-packs small files into target-sized ones. It is the canonical fix for the small-file problem, because the per-file cap is only a per-task upper bound and never merges files that are already written. Compaction is also where you apply a sort order to physically cluster data and tighten min/max bounds for better pruning.
* **Snapshot expiration** (`expire_snapshots`) drops old snapshots so metadata stays lean and obsolete data files can be reclaimed. Retain a window (commonly 7 days) for time travel.
* **Manifest rewrite** (`rewrite_manifests`) consolidates many small manifests created by frequent commits, keeping query planning fast.
* **Orphan file cleanup** (`remove_orphan_files`) deletes files no longer referenced by any snapshot. Always keep a safety window (e.g. 3 days) so in-flight writes are not removed.

Tables written by high-frequency or `merge`/CDC loads need this maintenance most. Schedule it during off-peak hours so it does not contend with queries. See the [Apache Iceberg maintenance docs](https://iceberg.apache.org/docs/latest/maintenance/) for procedures.

## Notes

* Azure is consistently 15-45 minutes slower than AWS across every configuration and dataset shape. This is a Lakehouse catalog characteristic, not a Nilus tuning issue.
* Merge on large datasets should always be incremental. A full-table `merge` on the wide 10M-row table took \~7h 53min. A `merge` rewrites every data file that contains an affected row (copy-on-write), so the cost scales with the size of the files touched, not the number of changed rows. Full-table merges amplify writes enormously. Load only changed/new rows per run so the merge window stays small, and rely on destination-side compaction to keep the table tidy between loads.
* Normalization time is negligible (under \~2 seconds in every run). It is not a tuning target.

## Related docs

* [Optimize Sink Datasets](/concepts/resources/nilus/pipeline-optimization/optimize-sink-datasets.md)
* [Shape Knobs](/concepts/resources/nilus/pipeline-optimization/optimize-sink-datasets/optimize-shape-knobs.md)
* [Optimizing for Time](/concepts/resources/nilus/pipeline-optimization/optimizing-for-time.md)
* [Optimizing for Resource](/concepts/resources/nilus/pipeline-optimization/optimizing-for-resource.md)
* [Pipeline Optimization](/concepts/resources/nilus/pipeline-optimization.md)
* [Apache Iceberg: Performance](https://iceberg.apache.org/docs/latest/performance/)
* [Apache Iceberg: Maintenance](https://iceberg.apache.org/docs/latest/maintenance/)
* [Apache Iceberg: Configuration](https://iceberg.apache.org/docs/latest/configuration/)


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://v2.dataos.info/concepts/resources/nilus/pipeline-optimization/optimize-sink-datasets/optimize-lakehouse-iceberg-loads.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.