> For the complete documentation index, see [llms.txt](https://v2.dataos.info/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://v2.dataos.info/concepts/resources/nilus/observability/prometheus-metrics.md).

# Exposed Prometheus Metrics

Nilus exposes runtime and manager metrics that help you answer four questions: is the pipeline running, is it slow, did it process records, and is the manager healthy?

## Is the pipeline running?

| Metric            | Type  | Labels        | Use it for                                                                  |
| ----------------- | ----- | ------------- | --------------------------------------------------------------------------- |
| `pipeline_status` | Gauge | `resource_id` | Latest run outcome. `1` indicates success and `0` indicates failure.        |
| `pipeline_start`  | Gauge | `resource_id` | Pipeline start timestamp. Use it to identify when the latest run began.     |
| `pipeline_end`    | Gauge | `resource_id` | Pipeline end timestamp. Use it to identify when the latest run completed.   |
| `nilus_start`     | Gauge | `resource_id` | Runtime start timestamp. Useful when investigating restarts.                |
| `nilus_end`       | Gauge | `resource_id` | Runtime end timestamp. Useful when confirming whether a run exited cleanly. |

## Is it slow?

| Metric                  | Type    | Labels | Use it for                                                                                                |
| ----------------------- | ------- | ------ | --------------------------------------------------------------------------------------------------------- |
| `duration_sec`          | Counter | none   | Total pipeline duration in seconds. Compare this against similar input volumes and schedule windows.      |
| `step_duration_seconds` | Counter | `step` | Stage-level duration. Use it to identify whether extraction, normalization, or loading is the bottleneck. |

Prometheus client libraries may expose counter series with a `_total` suffix. When building alerts or dashboards, confirm the exact series names available in your Prometheus target.

## Did it process records?

| Metric                   | Type    | Labels          | Use it for                                                                                                  |
| ------------------------ | ------- | --------------- | ----------------------------------------------------------------------------------------------------------- |
| `records_processed`      | Counter | none            | Total records processed during the run. Compare with expected source and destination counts.                |
| `step_records_processed` | Counter | `step`, `table` | Records, files, or jobs processed by stage and table. Use it to identify where volume changed unexpectedly. |

## Is resource use healthy?

| Metric             | Type  | Labels | Use it for                                      |
| ------------------ | ----- | ------ | ----------------------------------------------- |
| `cpu_percent`      | Gauge | none   | Maximum CPU percentage observed during the run. |
| `memory_mb`        | Gauge | none   | Maximum memory usage observed during the run.   |
| `step_cpu_percent` | Gauge | `step` | Stage-level CPU pressure.                       |
| `step_memory_mb`   | Gauge | `step` | Stage-level memory pressure.                    |

High CPU or memory is not automatically a failure. Investigate when resource use grows while throughput falls, or when runs approach configured resource limits.

## Is the manager healthy?

| Metric                          | Type      | Labels                         | Use it for                                            |
| ------------------------------- | --------- | ------------------------------ | ----------------------------------------------------- |
| `http_requests_total`           | Counter   | `method`, `endpoint`, `status` | Request volume and status mix for Nilus Manager APIs. |
| `http_request_duration_seconds` | Histogram | `method`, `endpoint`           | Request latency for Nilus Manager APIs.               |

## Data plane labels

Every Nilus runtime metric (and the Debezium-based CDC metrics) carries a consistent set of data plane labels so you can filter and aggregate across tenants and deployments:

| Label                    | Value                                                                 | Use it for                                                        |
| ------------------------ | --------------------------------------------------------------------- | ----------------------------------------------------------------- |
| `dataplane_name`         | Logical data plane name, e.g. `heliosdev` (not the environment FQDN). | Filtering metrics by data plane in Grafana label selectors.       |
| `dataplane_type`         | Data plane type.                                                      | Grouping by deployment class.                                     |
| `dataplane_network_type` | Data plane network type.                                              | Distinguishing network topologies.                                |
| `dataos_fqdn`            | The environment hostname/FQDN, e.g. `heliosdev-060426.dataos.cloud`.  | Identifying the specific environment behind a logical data plane. |

{% hint style="info" %}
`dataplane_name` now holds the **logical** data plane identifier. Earlier builds populated it with the environment FQDN. If you have Grafana queries or alerts that matched on a hostname-shaped `dataplane_name`, switch them to match `dataos_fqdn` for the FQDN and `dataplane_name` for the logical name.
{% endhint %}

## Pushgateway and scrape notes

* Nilus runtime metrics are pushed with labels such as `stack_name`, `resource_id`, `tenant_name`, and the data plane labels above (`dataplane_name`, `dataplane_type`, `dataplane_network_type`, `dataos_fqdn`).
* Nilus Manager metrics use labels such as `instance`, `tenant`, `resource_id`, and `user`.
* Pushgateway may also expose series such as `push_time_seconds` and `push_failure_time_seconds`; those describe metric push health rather than pipeline business behavior.
* Use freshness alongside status. A successful but stale metric series can hide a pipeline that has not run recently.

## Related docs

* [Observability](/concepts/resources/nilus/observability.md)
* [Grafana Dashboards](/concepts/resources/nilus/observability/grafana-dashboard.md)
* [Pipeline Optimization](/concepts/resources/nilus/pipeline-optimization.md)


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://v2.dataos.info/concepts/resources/nilus/observability/prometheus-metrics.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
