> For the complete documentation index, see [llms.txt](https://v2.dataos.info/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://v2.dataos.info/concepts/resources/nilus/pipeline-optimization/optimize-sink-datasets/optimize-destination-gotchas-and-troubleshooting.md).

# Destination Gotchas & Troubleshooting

*Part of* [*Optimize Sink Datasets*](/concepts/resources/nilus/pipeline-optimization/optimize-sink-datasets.md)*.*

## Destination-specific gotchas

Known constraints and quirks, grouped by destination.

### Snowflake

* Auto-clusters in the background using warehouse credits. `cluster_by` does not do real-time re-clustering.
* A warehouse must be active for every read/write. A small dedicated ingestion warehouse with `AUTO_SUSPEND = 60` is the standard cost-control pattern.
* `loader_file_format: parquet` is auto-set. Safe to leave unset.

### BigQuery

* Bills per byte processed. Align `incremental_key` with the destination table's partition column so query jobs prune partitions.
* **Set `loader_file_format: parquet` explicitly**: parquet is not auto-set for BigQuery (the relevant `PARQUET_SUPPORTED_DESTINATIONS` entry is a stale typo at the time of writing).
* Clustered tables must be created with explicit DDL. BigQuery does not re-cluster existing tables in place. Re-clustering after-the-fact requires a `CREATE TABLE AS SELECT` rebuild.
* `staging_bucket` must live in the same region as the destination dataset.

### DataOS Lakehouse

* Iceberg is highly sensitive to file count. Keep `loader_file_size` large (200k to 500k rows) and partition counts low.
* `loader_file_format: parquet` is **not** auto-set. Set it explicitly.
* High-cardinality `partition_by` with `type: identity` will explode the manifest. Use `type: bucket` instead.

### Redshift

* `incremental_strategy: replace` performs `DELETE` + reload, which is materially slower than Snowflake's `TRUNCATE` + reload.
* 1,600 columns per table; 127-byte identifier limit. Wide source tables may need `sql_exclude_columns` to fit.
* Wire-protocol-compatible with PostgreSQL. Most PG-batch troubleshooting applies.

### PostgreSQL

* Partitioned tables on the source side need the dedicated runbook. See [Working with PostgreSQL Partitioned Tables for CDC](/concepts/resources/nilus/troubleshooting/postgresql-cdc-partitioned-tables.md).
* Identifier case folding (lower-case unless quoted) catches most "table not found" reports.

### MongoDB

* Destination `incremental_strategy: merge` requires `primary_key` (raises `Merge operation requires primary keys for table '<collection>'`).
* Destination `incremental_strategy: append` is **not** supported (raises `Unsupported write disposition 'append' for MongoDB destination`).
* Source-side `filter_`, `projection`, `chunk_size`, `parallel`, `data_item_format` are **not** forwarded by the Nilus wrapper. Use the `database.collection:[<aggregation-pipeline>]` form for source-side filtering.

### Elasticsearch

* Destination supports `replace` only: drop-and-recreate-index semantics with batch\_size=1000 and request\_timeout=60.
* High-volume runs benefit from a pre-defined index template on the cluster side, since the destination does not configure mappings.

### MS SQL Server

* ODBC Driver 18 enforces TLS by default. Use `TrustServerCertificate=yes` only for development.
* Two auth modes: SQL Authentication, and Azure AD `Authentication=ActiveDirectoryAccessToken` via `MSSQL_COPT_SS_ACCESS_TOKEN`.

### AWS S3 file-system

* `incremental_strategy: replace` requires `s3:DeleteObject` IAM permission.
* `dest_table` must be in `<bucket-name>/<path>` format, not a database-style schema.

## Troubleshooting

| Symptom                                                           | Likely cause                                                                                                         | Resolution                                                                                                                                                                                       |
| ----------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| Pipeline is `OOMKilled`                                           | `page_size × row_bytes × extract_parallelism` exceeds pod memory                                                     | Lower `extract_parallelism` first, then `page_size`. Keep `loader_file_size` large. Raise pod memory if needed.                                                                                  |
| Run hits 100% jobs but stalls for many minutes                    | Commit-tail, too many small output files                                                                             | Raise `loader_file_size`. Consider coarser `partition_by`. Ensure object-store bandwidth is adequate.                                                                                            |
| Source database throttles / rejects connections                   | `extract_parallelism` too high, or `page_size` too large for source's per-query budget                               | Lower both. Confirm the source's connection-pool limits.                                                                                                                                         |
| Destination has thousands of tiny files                           | `loader_file_size` too low, or `partition_by` over-partitioned                                                       | Raise `loader_file_size`, simplify partitioning.                                                                                                                                                 |
| `additionalProperties` validation error on sink                   | Sink option is not in the schema's allow-list                                                                        | Remove the option. Allowed sink options: `aws_region`, `aws_endpoint`, `cluster_by`, `dest_table`, `full_refresh`, `incremental_strategy`, `loader_file_size`, `partition_by`, `staging_bucket`. |
| `MAX_TABLE_NESTING must be an integer, got 'foo'`                 | `max_table_nesting` was set to a non-numeric string                                                                  | Use `"0"`, `"1"`, or `"2"` (string-typed integers).                                                                                                                                              |
| `Merge operation requires primary keys for table`                 | `incremental_strategy: merge` without a `primary_key` set on the source                                              | Set `primary_key` on `source.options`, or switch to `replace` / `append`.                                                                                                                        |
| Run extracts the full table every time despite `incremental_key`  | The source column is not actually monotonic, or `full_refresh: true` was left set, or pipeline state was reset       | Confirm the column is truly monotonic; remove `full_refresh: true` from steady-state runs.                                                                                                       |
| BigQuery jobs bill far more bytes than expected                   | `incremental_key` is not on a partition column, or the table is not partitioned, so the engine scans the whole table | Move `incremental_key` to the partition column (typically `_PARTITIONTIME` or an explicit DATE column).                                                                                          |
| Destination commit succeeds but data appears duplicated           | `incremental_strategy: append` with overlapping `interval_start`/`interval_end` runs                                 | Use `merge` for mutable rows; do not overlap windows in `append` runs.                                                                                                                           |
| Snowflake / Redshift / BigQuery loads are slow despite small data | `loader_file_format` defaulted to a non-parquet format, or `staging_bucket` is unset                                 | Set `loader_file_format: parquet`; for very large loads, set `staging_bucket: gs://...` (BigQuery) or the equivalent for Snowflake/Redshift.                                                     |

## Stack-schema reference

Validation rules for `source.options` and `sink.options`:

| Section          | Strictness                                                              | Allowed keys                                                                                                                                                                                                                                                                 |
| ---------------- | ----------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `source.options` | `additionalProperties: true` (extras silently ignored if not forwarded) | `source_table`, `primary_key`, `incremental_key`, `interval_start`, `interval_end`, `type_hints`, `mask`, `max_table_nesting`, `page_size`, `extract_parallelism`, `sql_limit`, `sql_exclude_columns`, `sql_reflection_level`, `yield_limit`, plus connector-specific extras |
| `sink.options`   | `additionalProperties: false` (extras rejected at YAML validation)      | `dest_table`, `incremental_strategy`, `loader_file_size`, `partition_by`, `cluster_by`, `staging_bucket`, `full_refresh`, `aws_region`, `aws_endpoint`                                                                                                                       |

`loader_file_format` is supplied at the **runtime / CLI level** (env var `LOADER_FILE_FORMAT` or pipeline runtime spec), not inside `sink.options`. The same is true for `LOAD__WORKERS`, `NORMALIZE__WORKERS`, and `LOAD__PARALLELISM_STRATEGY`.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://v2.dataos.info/concepts/resources/nilus/pipeline-optimization/optimize-sink-datasets/optimize-destination-gotchas-and-troubleshooting.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.