> For the complete documentation index, see [llms.txt](https://v2.dataos.info/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://v2.dataos.info/concepts/resources/nilus/batch/pipeline-config.md).

# Pipeline Config

> Nilus batch pipelines are authored as `type: nilus` resources with `spec.type: batch`.

This page explains the current batch pipeline authoring shape. Author the `nilus` resource and keep the operational intent under `spec`.

<details>

<summary>Example Config YAML</summary>

```yaml
name: nilus-batch-example
version: v1alpha
type: nilus
tags:
  - workflow
  - nilus-batch
description: Batch sync from PostgreSQL into a DataOS depot
spec:
  type: batch
  compute: universe-compute
  schedule:
    crons:
      - "30 11 * * *"
    concurrencyPolicy: Forbid
  resources:
    requests:
      cpu: "200m"
      memory: "256Mi"
    limits:
      cpu: "1000m"
      memory: "2Gi"
  use:
    projection:
      secrets:
        - id: engineering:niluspgsecret
          contextAlias: pgsecret
      projections:
        envVars:
          - key: PG_USERNAME
            template: "{{ secrets['pgsecret'].username | base64_decode }}"
          - key: PG_PASSWORD
            template: "{{ secrets['pgsecret'].password | base64_decode }}"
  source:
    address: postgresql://{PG_USERNAME}:{PG_PASSWORD}@postgres.example.com:5432/postgres
    options:
      source_table: public.customer
  sink:
    address: dataos://niluspgdepot?purpose=rw
    options:
      dest_table: testing_nilus.customer
      incremental_strategy: replace
```

</details>

## Configuration elements

Fields are grouped below by function.

### 1. Metadata

| Field         | Description                                           |
| ------------- | ----------------------------------------------------- |
| `name`        | Unique pipeline name.                                 |
| `version`     | Use `v1alpha` for the current Nilus resource shape.   |
| `type`        | Must be `nilus`.                                      |
| `tags`        | Optional labels for search, grouping, and operations. |
| `description` | Optional human-readable summary.                      |

### 2. Nilus spec

The `spec` block defines the batch pipeline contract that Nilus validates and then renders into an executable workflow.

| Field       | Required              | Description                                                           |
| ----------- | --------------------- | --------------------------------------------------------------------- |
| `compute`   | Yes                   | Compute profile used to run the batch workload.                       |
| `resources` | No                    | Optional CPU and memory requests or limits.                           |
| `logLevel`  | No                    | Optional log level such as `DEBUG`, `INFO`, `WARNING`, or `ERROR`.    |
| `repo`      | No                    | Optional repository block for repo-backed connector extensions.       |
| `sink`      | Yes for data movement | Destination definition for batch pipelines that write extracted data. |

### 3. Schedule and runtime controls

Use `schedule` only when the batch pipeline should run on a recurring cadence.

```yaml
spec:
  schedule:
    crons:
      - "30 11 * * *"
    timezone: UTC
    endOn: "2026-12-31T23:59:59Z"
    concurrencyPolicy: Forbid
```

* `crons` is an array, not a single `cron` string.
* If `schedule` is omitted, the batch pipeline behaves like an instance that must be triggered manually.
* `concurrencyPolicy` controls overlap handling for scheduled runs.

Runtime settings such as compute, logs, and resource sizing stay under `spec`:

```yaml
spec:
  compute: universe-compute
  logLevel: INFO
  resources:
    requests:
      cpu: "200m"
      memory: "256Mi"
    limits:
      cpu: "1000m"
      memory: "2Gi"
```

### 4. Persistent volumes and restricted runtimes

For large Lakehouse (Iceberg) loads, the Nilus runtime stages intermediate Parquet files on the local filesystem before committing them as an Iceberg snapshot. By default those files land on the pod's ephemeral filesystem, which is small, shared with the OS, and discarded when the pod restarts. For wide or high-volume tables this can cause disk exhaustion and pod eviction mid-run.

A DataOS Volume gives the pipeline durable, independently-sized scratch space decoupled from the pod's ephemeral disk. Attach one by declaring it under `spec.use.volumes` and pointing `DATAOS_PERSISTENT_DIR` at the same mount path:

```yaml
spec:
  use:
    volumes:
      - id: <volume-id>
        directory: /var/dataos/public/nilus_scratch
        readOnly: false
    projection:
      projections:
        envVars:
          - key: DATAOS_PERSISTENT_DIR
            template: "/var/dataos/public/nilus_scratch"
```

| Field                   | Required        | Description                                                                                   |
| ----------------------- | --------------- | --------------------------------------------------------------------------------------------- |
| `id`                    | Yes             | Id of an existing DataOS Volume resource.                                                     |
| `directory`             | Yes             | In-container mount path for the volume.                                                       |
| `readOnly`              | No              | Defaults to `false`. Use `true` only for read-only reference mounts, never for scratch space. |
| `DATAOS_PERSISTENT_DIR` | Yes (for spill) | Must match `directory` so staging files are written to the volume.                            |

Size the volume to at least a few multiples of your `DATA_WRITER__FILE_MAX_BYTES` cap multiplied by `LOAD__WORKERS`, so concurrent file writers all have room. See [Tuning Large Lakehouse (Iceberg) Loads](/concepts/resources/nilus/pipeline-optimization/optimize-sink-datasets/optimize-lakehouse-iceberg-loads.md) for benchmark-backed configurations.

### 5. Secrets and repo settings

For direct connector URIs, project secrets under `spec.use.projection` and reference them through `{ENV_VAR}` placeholders in `source.address` or `sink.address`.

```yaml
spec:
  use:
    projection:
      secrets:
        - id: engineering:niluspgsecret
          contextAlias: pgsecret
      projections:
        envVars:
          - key: PG_USERNAME
            template: "{{ secrets['pgsecret'].username | base64_decode }}"
```

Use `spec.repo` only when the pipeline depends on a repo-backed extension:

```yaml
spec:
  repo:
    url: https://bitbucket.org/org/custom-connectors
    baseDir: connectors
    secretId: engineering:bitbucketsecret
```

### 6. Source

The `source` block defines where Nilus reads from. Use a direct connector URI or a `dataos://` depot address.

```yaml
spec:
  source:
    address: postgresql://{PG_USERNAME}:{PG_PASSWORD}@postgres.example.com:5432/postgres
    options:
      source_table: public.customer
      incremental_key: updated_at
```

#### Source fields

| Field     | Required | Description                                   |
| --------- | -------- | --------------------------------------------- |
| `address` | Yes      | Connector URI or `dataos://` depot reference. |

#### `source.options`

These are the batch-oriented `source.options` keys currently consumed by the Nilus runtime. Some are broadly useful across SQL-style batch connectors, while others are connector-specific tuning knobs.

| Option                | Required | What it does                                                                                                                                                                                                                                                                                                                                                              | Typical shape                                    |
| --------------------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------ |
| `source_table`        | Yes      | Logical source object. Depending on the connector, this may be a SQL table, a file selector, a sheet name, an API object, or a report definition.                                                                                                                                                                                                                         | `public.customer`                                |
| `primary_key`         | No       | Logical row identifier used mainly with merge-style loads and downstream key-aware behavior.                                                                                                                                                                                                                                                                              | `id`                                             |
| `incremental_key`     | No       | Cursor field used for incremental extraction on connectors that support bounded or ordered reads.                                                                                                                                                                                                                                                                         | `updated_at`                                     |
| `interval_start`      | No       | Optional lower bound for bounded incremental extraction.                                                                                                                                                                                                                                                                                                                  | `"2026-01-01T00:00:00Z"`                         |
| `interval_end`        | No       | Optional upper bound for bounded incremental extraction.                                                                                                                                                                                                                                                                                                                  | `"2026-01-31T23:59:59Z"`                         |
| `type_hints`          | No       | Column type overrides that Nilus maps into the underlying column-hints contract.                                                                                                                                                                                                                                                                                          | `{ amount: decimal, created_at: timestamp }`     |
| `mask`                | No       | Column masking rules. Nilus expects an object whose values are algorithm strings such as `hash`, `redact`, or `partial:3`.                                                                                                                                                                                                                                                | `{ email: hash, phone: partial:3 }`              |
| `extract_parallelism` | No       | Number of extraction jobs to run in parallel for supported connectors.                                                                                                                                                                                                                                                                                                    | `5`                                              |
| `page_size`           | No       | Rows or items fetched per extraction page for paged connectors.                                                                                                                                                                                                                                                                                                           | `50000`                                          |
| `sql_limit`           | No       | Extraction row limit. This is mainly useful for testing, sampling, and EDA rather than production loads.                                                                                                                                                                                                                                                                  | `5000`                                           |
| `sql_exclude_columns` | No       | Columns to skip during extraction. The current runtime expects a comma-separated string.                                                                                                                                                                                                                                                                                  | `debug_payload,raw_html`                         |
| `yield_limit`         | No       | Caps how many extraction pages or batches the source yields before the run stops.                                                                                                                                                                                                                                                                                         | `10`                                             |
| `max_table_nesting`   | No       | Schema hint that caps how deeply nested fields are flattened. Nodes deeper than this level are loaded as a struct or JSON column instead of being expanded into their own table. Relevant for sources that emit nested records (MongoDB, JSON file stores, SaaS APIs with nested payloads). `0` keeps the runtime default and is appropriate for flat relational sources. | `0` (default): increase to flatten nested fields |

#### Source option notes

* Start with `source_table` only, then add `incremental_key` and `primary_key` when the connector and load strategy actually need them.
* `type_hints` and `mask` are object-valued options. Nilus turns them into the lower-level runtime contract for you.
* `extract_parallelism`, `page_size`, `sql_limit`, `sql_exclude_columns`, `yield_limit`, and `max_table_nesting` are best treated as tuning knobs, not mandatory pipeline fields.
* `max_table_nesting` only applies to sources that produce nested records. For flat relational sources, leave it at the default.
* Connector pages remain the source of truth for connector-specific `source_table` syntax and any additional connector-only options.

#### Direct URI vs depot address

* Use a direct connector URI when you need to assemble the connection explicitly and provide credentials through `spec.use.projection`.
* Use `dataos://<depot>?purpose=<ro|rw>` when the connection should be inferred from a depot.

### 7. Sink

The `sink` block defines where Nilus writes the batch output.

```yaml
spec:
  sink:
    address: dataos://niluspgdepot?purpose=rw
    options:
      dest_table: testing_nilus.customer
      incremental_strategy: replace
```

#### Sink fields

| Field     | Required | Description                                     |
| --------- | -------- | ----------------------------------------------- |
| `address` | Yes      | Destination URI or `dataos://` depot reference. |

#### `sink.options`

These are the commonly used batch `sink.options` keys currently wired through the Nilus runtime.

| Option                 | Required | What it does                                                                                                                                           | Typical shape              |
| ---------------------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------ | -------------------------- |
| `dest_table`           | Yes      | Target object name in the destination.                                                                                                                 | `analytics.customer`       |
| `incremental_strategy` | Yes      | Write behavior across repeated runs. The stable Nilus authoring surface should treat this as `replace`, `append`, or `merge`.                          | `replace`                  |
| `partition_by`         | No       | Destination layout hint for partitioned sinks. This is most relevant for lakehouse-style destinations and should follow the destination page examples. | structured partition rules |
| `cluster_by`           | No       | Optional clustering hint for destinations that implement clustering or sort-like organization.                                                         | `customer_id`              |
| `full_refresh`         | No       | Requests a full refresh of the underlying pipeline resource state before loading.                                                                      | `true`                     |
| `loader_file_size`     | No       | Controls how many rows Nilus writes per output file or loader batch.                                                                                   | `100000`                   |

#### Sink option notes

* `incremental_strategy` belongs under `sink.options`, even though it controls cross-run behavior for the whole pipeline.
* `partition_by` and `cluster_by` are destination-sensitive. Use them only when the destination page explicitly supports them and the access pattern justifies them.
* `full_refresh` and `incremental_strategy: replace` are not identical. `replace` governs write disposition, while `full_refresh` also resets the underlying pipeline resource state for the run.
* `loader_file_size` is a performance and file-layout knob. Start with the runtime default and tune gradually.

Destination-specific options such as region overrides, staging locations, loader format overrides, or warehouse-native settings should still be documented in the corresponding destination page.

### 8. Execution model

The important authoring consequence is:

* Author the `nilus` resource in `v1alpha`
* Keep the operational intent in `spec`
* Treat this page as the primary reference for batch pipeline configuration

## Validation notes

* Referenced resources are validated at apply time. `dataos-ctl resource apply` fails fast if the `compute`, source depot, or sink depot does not exist in the tenant, with an error like `invalid resource dependency, not found: depot:<name>`. Create the dependency before applying. See [Common Errors](/concepts/resources/nilus/troubleshooting/common-errors.md).
* Use `type: nilus` with `spec.type: batch` for batch pipeline authoring.
* Put recurring schedule settings under `spec.schedule.crons`, not a legacy top-level workflow schedule block.
* Use `spec.use.projection` for direct connector credentials, use `dataos://...purpose=` when a depot should supply credentials automatically.
* Keep `source_table` semantics connector-aware: `schema.table` for SQL sources, named objects for SaaS sources, and resource-specific identifiers for connectors such as sheets or streams.
* Treat `extract_parallelism`, `page_size`, `sql_limit`, `sql_exclude_columns`, `yield_limit`, `max_table_nesting`, `partition_by`, `cluster_by`, `full_refresh`, and `loader_file_size` as optional tuning knobs, not baseline required fields.
* The Nilus runtime accepts some additional internal values and destination-specific options, but this page documents the stable batch pipeline authoring surface users should rely on by default.

## Related docs

* [Batch Sample Configs](/concepts/resources/nilus/batch/sample-configs.md)
* [Secrets and Projections](/concepts/resources/nilus/concepts/secrets-and-projections.md)
* [Understanding CDC Pipeline Config](/concepts/resources/nilus/cdc/service-config.md)
* [Understanding Stream Pipeline Config](/concepts/resources/nilus/stream.md)


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://v2.dataos.info/concepts/resources/nilus/batch/pipeline-config.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
