> For the complete documentation index, see [llms.txt](https://v2.dataos.info/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://v2.dataos.info/concepts/resources/nilus/observability/grafana-dashboard.md).

# Grafana Dashboards

Nilus ships two Grafana dashboards that turn raw pipeline telemetry into answers you can act on: **did my pipeline run, is my data fresh, how much landed, and if something broke, where do I look next**. They take you from "is my data current?" to "here is the pipeline, the run, and the next check" in a few clicks.

Use Grafana for live health and trends. Use resource logs for the exact failure message and the final remediation detail.

## The two dashboards

| Dashboard                | Who it is for                                           | How you scope it                     |
| ------------------------ | ------------------------------------------------------- | ------------------------------------ |
| **Nilus Pipelines**      | Pipeline owners and consumers (data engineers/analysts) | **Tenant + Pipeline (resource\_id)** |
| **Nilus Fleet Overview** | Operators and tenant admins                             | Tenant, owner, dataplane, pipeline   |

**Nilus Pipelines** is the everyday view. Pick your **Tenant**, then add the **Pipeline (resource\_id)** you want to observe. It answers: did it run, is the data fresh, how many rows landed, how long it took, and (for Change Data Capture (CDC)) is the service connected and caught up. It is scoped by *pipeline* rather than by who deployed it, so you can watch any pipeline whose output you depend on.

**Nilus Fleet Overview** is the fleet view for operators and admins. It covers every pipeline across tenants, **workflows** (batch and metadata), **services** (CDC), and **workers** (stream, including system-tenant workers such as the CloudEvents/NATS pipelines that feed the lakehouse and quickwit). It shows what needs attention, inventory, ownership, last-run reliability, resource footprint, and CDC health. Group by `user_name` to answer governance questions like "show me everything this user runs."

## How to read these dashboards

Both dashboards rely on one fact about the data: Nilus pipelines push metrics while they run, and the **last value is retained between runs**. The dashboards show each pipeline's **last-known status** as-is. A batch pipeline that last ran successfully a week ago keeps showing OK until its next run.

* **Healthy / Failing** count pipelines by their last-known run status. Read them together with **Telemetry age** to know whether that status is recent or historical.
* **Data Freshness** tells you how current the data feeding your tables actually is.
* The **Selected Pipeline** detail section on Nilus Pipelines fills in only when you pick a *single* pipeline. With multiple or All selected it stays empty by design, to keep the per-pipeline tiles readable.

### Everyday flow (Nilus Pipelines)

1. Open **Nilus Pipelines** and select your **Tenant**, then the **Pipeline(s)** you want to observe.
2. Read the **Pipelines In Scope** table: last status, telemetry age, records, and duration for each, with a **Deployed by** column showing the owner.
3. Check **Data Freshness (oldest)** to know how current your input data is.
4. To confirm the rows you expect actually landed, read **Records by Stage / Table**.
5. If a pipeline failed, open its resource logs for the exact error. The dashboard narrows *where* to look.

## CDC service health

CDC services run continuously and push metrics every few seconds. The **CDC Service Health** table and tiles count only **live** services: those that have pushed telemetry within the last hour. Services you deleted or stopped testing drop off instead of showing a stale `connected=yes` forever. Alongside connection state and source lag, the table surfaces **committed transactions** and **events seen**, which are the closest available progress signals. Nilus does not emit raw connector offsets (Log Sequence Number (LSN)/consumer offsets) to metrics, so for batch and worker pipelines the equivalent progress signal is **records processed**, shown in the volume panels.

## Required setup

1. Confirm Nilus pushes runtime metrics to Pushgateway.
2. Confirm Prometheus scrapes the Pushgateway and Nilus Manager metrics server.
3. Confirm metrics carry labels such as `stack_name`, `resource_id`, `tenant_name`, `user_name`, and the four data plane identity labels: `dataplane_name` (logical name, not FQDN), `dataplane_type`, `dataplane_network_type`, and `dataos_fqdn` (environment FQDN). If Grafana label selectors previously matched on a hostname-shaped `dataplane_name`, update them to use `dataos_fqdn` for the FQDN and `dataplane_name` for the logical name. See [Exposed Prometheus Metrics](/concepts/resources/nilus/observability/prometheus-metrics.md#data-plane-labels) for the full label reference.
4. Import both dashboard JSON files into Grafana. Select the Prometheus data source. Set the tenant and pipeline filters before you read the panels.

{% file src="/files/iBaHCPrOO7TFt2ZgL3lq" %}

{% file src="/files/V5IEhwMdau67YHqP1dxB" %}

## What to trust, and what to treat with care

* **Last-known status** is a point-in-time snapshot of the most recent run, not a historical success rate. The Fleet Overview **Last-Run Pass Rate** tile is labelled accordingly. Always read it next to **Telemetry age** to know whether the run was recent.
* **High CPU or memory** is a question, not a failure. It matters most paired with low throughput, long duration, or repeated failures. Use it for sizing.
* **Telemetry age** tells you whether the other panels describe a recent run or a historical one. Always read it alongside status.
* **CDC lag** ignores Debezium's `-1` idle sentinel. A connected service with no active change stream reports no lag, not negative lag.

## Caveats

* A pipeline that crashed hard may not emit a final status. Pair status with telemetry age and resource logs.
* Pipeline types are identified by the `resource_id` prefix: `workflow:` (batch and metadata), `service:` (CDC), and `worker:` (stream).
* Schema and data-quality details are not available from metrics today. The dashboards report row counts, not column-level changes.
* For exact per-run history beyond the last retained value, use the Nilus Manager pipeline records rather than Grafana.

## Related docs

* [Observability](/concepts/resources/nilus/observability.md)
* [Exposed Prometheus Metrics](/concepts/resources/nilus/observability/prometheus-metrics.md)
* [Checking Logs](/concepts/resources/nilus/troubleshooting/checking-logs.md)


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://v2.dataos.info/concepts/resources/nilus/observability/grafana-dashboard.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
