There are two kinds of observability dashboards. The kind that get opened in an incident and the kind that don’t. This is about building the first kind.
I’ve published a sample of the dashboard design at my portfolio; this post is about the design choices and what we deliberately left out.
The data sources
- Prometheus for service-level metrics (latency, error rate, saturation, queue depths).
- CloudWatch for AWS-managed services (RDS, ALB, NAT, Lambda).
- GuardDuty for security findings, surfaced inline so SRE sees them, not in a separate tool.
- CloudTrail for “what changed?” — the single most important question in any incident.
- Synthetic checks (k6 + a tiny in-house probe) for the user’s perspective, not the system’s.
The layout (top to bottom)
- Customer KPIs. Successful checkouts per minute, signups per minute, p95 page load. The numbers a non-engineer can read.
- Golden signals per service. Latency, traffic, errors, saturation. One row per critical service. Colour by SLO burn.
- Infra spine. EKS cluster health, RDS connections, NAT throughput, ALB 5xx rate. Cross-region side-by-side.
- Security & change. Active GuardDuty findings, count of CloudTrail events from privileged roles in the last hour, IAM policy changes today.
- Synthetics. “Can a real user log in and click buy right now?” A green/red answer for each critical journey.
Every panel has a link out: to the runbook for the alert it represents, to the service’s deeper dashboard, to the relevant CloudTrail filter, to the on-call rotation. The dashboard is a launchpad, not a destination.
If your incident commander has to leave Grafana to figure out what changed in the last hour, your dashboard is missing the most important panel.
Multi-cluster & multi-region
Production and DR run side-by-side in every infra panel. During a quarterly DR drill we don’t open a different dashboard — we watch the DR side go from idle to live on the same screen. Side-by-side panels are also how we caught the AMI drift bug we couldn’t reproduce in tabletop.
What we deliberately left out
- JVM internals. Belongs on the per-service dashboard, not the front page.
- Per-pod CPU. Same. The front page is “is the user okay?”, not “is this pod okay?”
- Cost. Important, but not an incident signal. Lives on a separate FinOps board.
- Vanity metrics. If a panel hasn’t informed an action in 90 days, we delete it. Dashboards rot if you don’t prune them.
Alerting policy
- Alert on user impact, not on causes. “p95 checkout > 3s for 5 minutes” is an alert. “Pod CPU > 80%” is a metric.
- SLO-burn-rate alerting for slow burns; raw thresholds for fast burns.
- One owner per alert. Page goes to the team that can actually fix it. Routing alerts to a generic on-call queue is how alert fatigue starts.
- Every alert links to its dashboard panel and its runbook. If it doesn’t, it’s a bug.
A great dashboard is opinionated. It says “these five things matter; everything else is a click away.” Dashboards that try to show everything end up showing nothing.
Further reading
- Grafana unified dashboard sample
- Tabletop to live drill: ransomware DR
- 10-minute deploys: CI/CD on 300+ services
If your incident commander still alt-tabs between five tools to answer “what just changed?”, talk to us.