← Back to News & Media

One pane of glass: a unified Grafana dashboard for prod, DR and customer KPIs

5data sources, 1 view
99.95%uptime tracked
Side-by-sideprod & DR
0vanity panels

There are two kinds of observability dashboards. The kind that get opened in an incident and the kind that don’t. This is about building the first kind.

I’ve published a sample of the dashboard design at my portfolio; this post is about the design choices and what we deliberately left out.

The data sources

  • Prometheus for service-level metrics (latency, error rate, saturation, queue depths).
  • CloudWatch for AWS-managed services (RDS, ALB, NAT, Lambda).
  • GuardDuty for security findings, surfaced inline so SRE sees them, not in a separate tool.
  • CloudTrail for “what changed?” — the single most important question in any incident.
  • Synthetic checks (k6 + a tiny in-house probe) for the user’s perspective, not the system’s.

The layout (top to bottom)

  1. Customer KPIs. Successful checkouts per minute, signups per minute, p95 page load. The numbers a non-engineer can read.
  2. Golden signals per service. Latency, traffic, errors, saturation. One row per critical service. Colour by SLO burn.
  3. Infra spine. EKS cluster health, RDS connections, NAT throughput, ALB 5xx rate. Cross-region side-by-side.
  4. Security & change. Active GuardDuty findings, count of CloudTrail events from privileged roles in the last hour, IAM policy changes today.
  5. Synthetics. “Can a real user log in and click buy right now?” A green/red answer for each critical journey.

Every panel has a link out: to the runbook for the alert it represents, to the service’s deeper dashboard, to the relevant CloudTrail filter, to the on-call rotation. The dashboard is a launchpad, not a destination.

If your incident commander has to leave Grafana to figure out what changed in the last hour, your dashboard is missing the most important panel.

Multi-cluster & multi-region

Production and DR run side-by-side in every infra panel. During a quarterly DR drill we don’t open a different dashboard — we watch the DR side go from idle to live on the same screen. Side-by-side panels are also how we caught the AMI drift bug we couldn’t reproduce in tabletop.

What we deliberately left out

  • JVM internals. Belongs on the per-service dashboard, not the front page.
  • Per-pod CPU. Same. The front page is “is the user okay?”, not “is this pod okay?”
  • Cost. Important, but not an incident signal. Lives on a separate FinOps board.
  • Vanity metrics. If a panel hasn’t informed an action in 90 days, we delete it. Dashboards rot if you don’t prune them.

Alerting policy

  • Alert on user impact, not on causes. “p95 checkout > 3s for 5 minutes” is an alert. “Pod CPU > 80%” is a metric.
  • SLO-burn-rate alerting for slow burns; raw thresholds for fast burns.
  • One owner per alert. Page goes to the team that can actually fix it. Routing alerts to a generic on-call queue is how alert fatigue starts.
  • Every alert links to its dashboard panel and its runbook. If it doesn’t, it’s a bug.
Takeaway

A great dashboard is opinionated. It says “these five things matter; everything else is a click away.” Dashboards that try to show everything end up showing nothing.

Further reading

If your incident commander still alt-tabs between five tools to answer “what just changed?”, talk to us.

Have a job we could help with?

30-minute call. Senior engineer on the line. We’ll tell you whether AI, cloud or a custom build is the right tool — even if the answer’s no.