Grafana unified observability dashboard

There are two kinds of observability dashboards. The kind that get opened in an incident and the kind that don’t. This is about building the first kind.

I’ve published a sample of the dashboard design at my portfolio; this post is about the design choices and what we deliberately left out.

The data sources

Prometheus for service-level metrics (latency, error rate, saturation, queue depths).
CloudWatch for AWS-managed services (RDS, ALB, NAT, Lambda).
GuardDuty for security findings, surfaced inline so SRE sees them, not in a separate tool.
CloudTrail for “what changed?” — the single most important question in any incident.
Synthetic checks (k6 + a tiny in-house probe) for the user’s perspective, not the system’s.

The layout (top to bottom)

Customer KPIs. Successful checkouts per minute, signups per minute, p95 page load. The numbers a non-engineer can read.
Golden signals per service. Latency, traffic, errors, saturation. One row per critical service. Colour by SLO burn.
Infra spine. EKS cluster health, RDS connections, NAT throughput, ALB 5xx rate. Cross-region side-by-side.
Security & change. Active GuardDuty findings, count of CloudTrail events from privileged roles in the last hour, IAM policy changes today.
Synthetics. “Can a real user log in and click buy right now?” A green/red answer for each critical journey.

Every panel has a link out: to the runbook for the alert it represents, to the service’s deeper dashboard, to the relevant CloudTrail filter, to the on-call rotation. The dashboard is a launchpad, not a destination.

If your incident commander has to leave Grafana to figure out what changed in the last hour, your dashboard is missing the most important panel.

Multi-cluster & multi-region

Production and DR run side-by-side in every infra panel. During a quarterly DR drill we don’t open a different dashboard — we watch the DR side go from idle to live on the same screen. Side-by-side panels are also how we caught the AMI drift bug we couldn’t reproduce in tabletop.

What we deliberately left out

JVM internals. Belongs on the per-service dashboard, not the front page.
Per-pod CPU. Same. The front page is “is the user okay?”, not “is this pod okay?”
Cost. Important, but not an incident signal. Lives on a separate FinOps board.
Vanity metrics. If a panel hasn’t informed an action in 90 days, we delete it. Dashboards rot if you don’t prune them.

Alerting policy

Alert on user impact, not on causes. “p95 checkout > 3s for 5 minutes” is an alert. “Pod CPU > 80%” is a metric.
SLO-burn-rate alerting for slow burns; raw thresholds for fast burns.
One owner per alert. Page goes to the team that can actually fix it. Routing alerts to a generic on-call queue is how alert fatigue starts.
Every alert links to its dashboard panel and its runbook. If it doesn’t, it’s a bug.

Takeaway

A great dashboard is opinionated. It says “these five things matter; everything else is a click away.” Dashboards that try to show everything end up showing nothing.

One pane of glass: a unified Grafana dashboard for prod, DR and customer KPIs

The data sources

The layout (top to bottom)

Multi-cluster & multi-region

What we deliberately left out

Alerting policy

Further reading

Have a job we could help with?

One pane of glass: a unified Grafana dashboard for prod, DR and customer KPIs

The data sources

The layout (top to bottom)

Multi-cluster & multi-region

What we deliberately left out

Alerting policy

Further reading

Keep reading

Tabletop to live drill: ransomware DR

Cutting AWS spend by 43%

AI-driven pipeline triage with LLMs + MCP

Have a job we could help with?