> For the complete documentation index, see [llms.txt](https://v2.dataos.info/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://v2.dataos.info/concepts/resources/nilus/cdc/cdc-sources/mongodb.md).

# MongoDB

MongoDB is supported as a CDC source for capturing inserts, updates, and deletes at document granularity. Nilus's CDC engine reads the replica set's change stream (built on top of the oplog) and translates each operation into a structured change event. Nilus persists the resume token through its REST-backed offset store so a restart picks up exactly where the previous run left off.

Nilus connects to MongoDB through a DataOS depot resolved with pipeline `type: cdc`. The depot keeps connection string, TLS material, and credentials outside the pipeline manifest.

## Connectivity

Nilus expects a replica-set or sharded-cluster connection string. Standalone servers are not supported because the change stream API requires a replica set.

| Form         | Address example           | Notes                                                                |
| ------------ | ------------------------- | -------------------------------------------------------------------- |
| DataOS depot | `dataos://my-mongo-depot` | Connection string, TLS, and credentials are resolved from the depot. |

Defaults applied by Nilus:

* `topic.prefix = cdc` (override per pipeline; the prefix is also used as the sink table prefix)
* Each change event is emitted with the new document state and timestamp metadata; deletes preserve the original `_id`.

### Depot-backed setup

Define a MongoDB **depot** that lists the replica-set members and a **secret** that holds the connection credentials. Nilus resolves the connection string, TLS material, and credentials from the depot at runtime.

**Secret:** the MongoDB user credentials:

```yaml
name: nilusmongocdcsecret
version: v2alpha
type: secret
description: MongoDB credentials for the CDC depot
layer: user
secret:
  type: key-value
  data:
    username: "<mongo-username>"
    password: "<mongo-password>"
```

**Depot:** replica-set members under `spec.nodes`, driver options under `spec.params`:

```yaml
name: nilusmongocdcdepot
version: v2alpha
type: depot
tags:
  - mongodb
description: MongoDB CDC source depot
spec:
  type: mongodb
  external: true
  spec:
    subprotocol: mongodb
    nodes:
      - mongodb-replicas.example.com:27017
      - mongodb-replicas.example.com:27018
      - mongodb-replicas.example.com:27019
    params:
      authSource: admin
      replicaSet: rs0
  secrets:
    - id: "<workspace>:nilusmongocdcsecret"
      purpose: rw
```

The pipeline then references only the depot: `address: dataos://nilusmongocdcdepot?purpose=rw`. For TLS clusters, Nilus maps the depot's TLS params and secret material onto the underlying CDC connector properties automatically (see [Behavior and capabilities](#behavior-and-capabilities)).

## Requirements

> *These requirements must be satisfied at the database before Nilus can run a MongoDB CDC pipeline.*

### Replica set or sharded cluster

* The MongoDB deployment must be a replica set, even if it has a single voting member. Standalone servers are not supported.
* Nilus reads from the change stream of the primary; a sharded cluster is supported and the connector follows shard primaries automatically.
* The hostnames and ports declared in the replica set configuration must match the ones reachable from Nilus. If they differ (for example because of a NAT boundary), reconfigure the replica set or use SRV records that resolve correctly from inside the Nilus runtime.

### Privileges and roles

The MongoDB user that Nilus connects as must have:

* Read access to every database and collection that the pipeline captures.
* Read access to the `local` database (oplog sits at `local.oplog.rs`) and to the `admin` database for `hello`/`replSetGetStatus`/`buildInfo` health calls.
* Read access to `config.shards` if the deployment is sharded.

A minimal role grant looks like this:

```javascript
db.getSiblingDB("admin").createUser({
  user: "nilus_cdc",
  pwd:  "<strong-password>",
  roles: [
    { role: "read",            db: "your_app_db" },
    { role: "read",            db: "local" },
    { role: "read",            db: "config" },
    { role: "readAnyDatabase", db: "admin" },
    { role: "clusterMonitor",  db: "admin" }
  ]
});
```

### TLS and authentication

* If TLS is required, the CA / truststore / keystore material referenced by the URI or depot must be available to the Nilus runtime.
* For SCRAM authentication, set `authSource` (typically `admin`) on the URI or pass it through the depot.
* For X.509 authentication, confirm the client certificate is available to the JVM and that `db.runCommand({connectionStatus: 1})` returns the expected `authenticatedUsers`.

### Pre-image support (optional but recommended)

If your downstream needs the *before* state of an updated document, enable the pre-image feature on the captured collection:

```javascript
db.runCommand({
  collMod: "<collection>",
  changeStreamPreAndPostImages: { enabled: true }
});
```

Without pre-images, change events still include the new document but the `before` field is `null` for updates.

> ⚠️ The MongoDB host used in the Nilus pipeline must be the same one declared in the replica set initiation configuration. A mismatch causes the connector to fail discovery silently and stop emitting events.

## Core concepts

1. **Replica set required**
   * MongoDB CDC reads from the change stream, which is layered on top of the replica set oplog.
   * Standalone servers do not have an oplog and are not supported.
   * In a sharded cluster Nilus follows each shard's primary independently.
2. **Change stream resume token**
   * Each event carries an opaque resume token that uniquely identifies its position in the change stream.
   * Nilus persists the token through its REST offset store, so a restart resumes from the last acknowledged position rather than re-reading the oplog from scratch.
   * If the resume token is older than the oldest available oplog entry, the connector falls back to a fresh snapshot of the captured collections.
3. **Schema-less documents, dynamic schema in the sink**
   * MongoDB collections are schema-less, so Nilus infers the sink schema from the documents it reads.
   * The sink table is created from the schema of the first document seen for that collection. New fields surface in subsequent runs through schema evolution.
   * The default `strategy = flatten` produces row-shaped records suitable for analytical sinks; `strategy = changelog` preserves the full CDC envelope.
4. **Oplog retention vs. connector lag**
   * The oplog is a capped collection. While the connector is running, MongoDB only retains entries up to its oldest connected reader.
   * If the connector pauses, lags, or is stopped for long enough that its required oplog entries roll off, the next run is forced to take a fresh snapshot.
   * Size the oplog (`--oplogSize`, or `replSetResizeOplog`) so that worst-case downtime fits inside the retained window.
5. **Snapshot and stream phases**
   * On first start, Nilus snapshots every captured collection so the sink starts from a consistent baseline.
   * After the snapshot completes, the connector switches to streaming mode and tails the change stream.
   * On restart, the previous resume token is reused so the snapshot phase is skipped.

> Operational deep dives:
>
> * [Handling MongoDB Error 286 in CDC Pipelines](/concepts/resources/nilus/troubleshooting/mongodb-cdc-error-286.md)
> * [Handling MongoDB Warning BufferingChangeStreamCursor in CDC Pipelines](/concepts/resources/nilus/troubleshooting/mongodb-cdc-bufferingchangestreamcursor.md)

### Change event shape

```json
{
  "before": null,
  "after": "{\"_id\":\"66f...\",\"name\":\"acme\",\"updated_at\":1700000000000}",
  "patch": null,
  "filter": null,
  "source": {
    "version": "3.4.2.Final",
    "connector": "mongodb",
    "name": "cdc",
    "ts_ms": 1700000000123,
    "snapshot": "false",
    "db": "retail",
    "rs": "rs0",
    "collection": "products",
    "ord": 1
  },
  "op": "c",
  "ts_ms": 1700000000234
}
```

`op` values: `r` (read during snapshot), `c` (insert), `u` (update), `d` (delete). With `transforms.unwrap` the envelope is flattened to the document body and `_op`, `_collection`, `_db`, and `_source.ts_ms` are appended as columns.

## Sample Nilus config

The following example uses a DataOS depot for connectivity and writes change events to a Lakehouse sink in `flatten` mode (one row per change).

<details>

<summary>Nilus YAML, MongoDB CDC → Lakehouse</summary>

```yaml
name: ncdc-mongodb-to-lakehouse
version: v1alpha
type: nilus
tags:
  - nilus-cdc
description: Nilus CDC pipeline for MongoDB → DataOS Lakehouse
spec:
  type: cdc
  compute: query-default
  logLevel: INFO
  source:
    address: dataos://prod-mongo-depot
    options:
      strategy: flatten
      max_table_nesting: "0"
    cdc:
      collection.include.list: retail.products,retail.orders
      topic.prefix: retail_cdc
      transforms.unwrap.array.encoding: array
      heartbeat.interval.ms: 60000
      max.batch.size: 2048
      max.queue.size: 8192
  sink:
    address: dataos://analytics-lakehouse
    options:
      dest_table: retail_cdc_products
      incremental_strategy: append
```

</details>

The depot referenced as `dataos://my-mongo-depot` resolves connection details (replica-set members, TLS, credentials) from its DataOS definition; no inline secret projection is needed in the manifest above.

## CDC options

These are the most useful CDC connector properties for MongoDB. They go under `source.cdc` in the Nilus manifest.

| Option                             | Default                      | Description                                                                                                                                                                                                                                                         |
| ---------------------------------- | ---------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `collection.include.list`          | *No default*                 | Comma-separated list of regular expressions or literal `database.collection` identifiers to capture. When set, every other collection is ignored.                                                                                                                   |
| `collection.exclude.list`          | *No default*                 | Inverse of `collection.include.list`. Mutually exclusive with the include list.                                                                                                                                                                                     |
| `database.include.list`            | *No default*                 | Optional database-level filter. Useful when capturing every collection in a database.                                                                                                                                                                               |
| `database.exclude.list`            | *No default*                 | Inverse of `database.include.list`.                                                                                                                                                                                                                                 |
| `field.exclude.list`               | *No default*                 | Comma-separated list of fully qualified field names (`db.coll.field.subfield`) to drop from change events. Wildcards are supported in `db` and `coll`.                                                                                                              |
| `topic.prefix`                     | `cdc` *(Nilus default)*      | Logical namespace for this pipeline. Must be unique across CDC pipelines. Nilus also uses it as the sink table prefix.                                                                                                                                              |
| `snapshot.mode`                    | `initial`                    | Snapshot behavior on first start. `initial` snapshots only when no offset is recorded; `always` re-snapshots every start; `no_data` snapshots structure only; `initial_only` snapshots and stops; `when_needed` snapshots when the resume token is no longer valid. |
| `capture.mode`                     | `change_streams_update_full` | Source of update events. Options include `change_streams`, `change_streams_update_full`, `change_streams_with_pre_image`, and `change_streams_update_full_with_pre_image`. The `pre_image` modes require pre-image support enabled on the collection.               |
| `heartbeat.interval.ms`            | `0` (off)                    | Periodic heartbeats keep the resume token advancing on idle pipelines. Recommended: `60000`.                                                                                                                                                                        |
| `max.batch.size`                   | `2048`                       | Maximum number of change events processed per batch.                                                                                                                                                                                                                |
| `max.queue.size`                   | `8192`                       | Maximum buffered change events before back-pressure kicks in.                                                                                                                                                                                                       |
| `transforms.unwrap.array.encoding` | `document`                   | `document` keeps arrays as nested documents; `array` flattens arrays into structured columns. Use `array` when writing to a tabular sink.                                                                                                                           |
| `mongodb.ssl.enabled`              | `false`                      | Enables TLS toward MongoDB. Combine with `mongodb.ssl.invalid.hostname.allowed` for self-signed environments.                                                                                                                                                       |
| `mongodb.poll.interval.ms`         | `30000`                      | How often the connector checks the cluster topology (replica set / shard membership).                                                                                                                                                                               |

## Behavior and capabilities

* **Pipeline mode**: MongoDB CDC runs as a `cdc` source.
* Treat `topic.prefix` as a stable identity. Once a pipeline is in production, changing it forces a fresh snapshot.
* Always size the oplog so that planned maintenance, weekend stalls, and worst-case lag fit inside its retention window.
* Enable pre-image support before turning on `*_with_pre_image` capture modes; otherwise updates surface with `before = null`.
* Prefer narrow `collection.include.list` patterns over broad database captures so unrelated collections do not snapshot.
* Use `heartbeat.interval.ms = 60000` (or smaller) for low-traffic collections so the resume token still advances.
* For DataOS depots, Nilus maps Mongo TLS and Java SSL settings from depot params and secrets into the underlying CDC connector properties automatically, no extra `java_options` block is needed for typical deployments.

## Troubleshooting

* **Symptom:** No events arrive even though documents are changing.
  * **Cause:** Connector configured against a standalone server, replica set members unreachable, or `collection.include.list` does not match any namespace.
  * **Recovery:** Confirm `db.hello().setName` returns the replica set name, validate connectivity to every replica set member from the Nilus runtime, and double-check the include list against `db.adminCommand({ listCollections: 1 })`.
* **Symptom:** Pipeline starts a fresh snapshot every restart.
  * **Cause:** The persisted resume token is older than the oldest oplog entry, usually because the oplog is too small for the connector lag.
  * **Recovery:** Resize the oplog with `replSetResizeOplog`, then restart. Consider raising `heartbeat.interval.ms` cadence and lowering `max.batch.size` so the resume token advances more frequently on low-traffic collections.
* **Symptom:** `BufferingChangeStreamCursor: Unable to acquire buffer lock, buffer queue is likely full` warnings repeating in the log.
  * **Cause:** Sink writes are slower than the change stream read rate; the buffer fills before the connector flushes a batch.
  * **Recovery:** Tune `max.batch.size`, `max.queue.size`, and the sink concurrency. See [Handling MongoDB Warning BufferingChangeStreamCursor in CDC Pipelines](/concepts/resources/nilus/troubleshooting/mongodb-cdc-bufferingchangestreamcursor.md).
* **Symptom:** `Error 286` (oplog history rolled off) on restart.
  * **Cause:** The connector was offline long enough that the resume token's oplog window expired.
  * **Recovery:** Allow Nilus to take a fresh snapshot, then resize the oplog to accommodate downtime going forward. See [Handling MongoDB Error 286 in CDC Pipelines](/concepts/resources/nilus/troubleshooting/mongodb-cdc-error-286.md).
* **Symptom:** Updates show `before = null` in the change events.
  * **Cause:** Pre-image support is not enabled on the source collection.
  * **Recovery:** Enable pre-images with `collMod` (see Requirements above), then redeploy the pipeline with `capture.mode` set to a `*_with_pre_image` variant.

## Related docs

* [Understanding CDC Pipeline Config](/concepts/resources/nilus/cdc/service-config.md)
* [CDC Sample Configs](/concepts/resources/nilus/cdc/sample-configs.md)
* [Understanding Change Data Capture](/concepts/resources/nilus/cdc.md)
* **DataOS Lakehouse destinations**: see the [AWS-backed DataOS Lakehouse](/concepts/resources/nilus/destinations/dataos-lakehouse/aws-backed.md), [Azure-backed DataOS Lakehouse](/concepts/resources/nilus/destinations/dataos-lakehouse/azure-backed.md), or [GCP-backed DataOS Lakehouse](/concepts/resources/nilus/destinations/dataos-lakehouse/gcp-backed.md) variants for writing CDC change events into a Lakehouse.
* [MongoDB (Batch)](/concepts/resources/nilus/batch/batch-sources/mongodb.md)


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://v2.dataos.info/concepts/resources/nilus/cdc/cdc-sources/mongodb.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
