close

DEV Community

Mrinal Narang
Mrinal Narang

Posted on

Dashboard Design for Incident Response

Most dashboards answer one question: Is everything okay?

During an incident, nobody's asking that.

The real question: What broke, where, and what changed?

Most dashboards fail at incidents because they were built for monitoring, not troubleshooting.

The Problem

A typical dashboard shows CPU, memory, disk, network, requests, uptime.

Useful for routine checks.

During an outage? Just noise.

You're not looking for reassurance. You're looking for evidence.

Two Different Jobs

Most teams put everything on one dashboard. That's a compromise that doesn't work for either job.

Monitoring dashboard: Is the platform healthy? SLAs being met? Resources used correctly?

Incident dashboard: What failed? When? What changed? Where do I look next?

Same tools, different purposes.

What Works During an Outage

Error rate front and center. 5XX errors, exceptions, failed transactions. Failures tell the story faster than CPU metrics.

Timeline on the graph. Mark deployments, infrastructure changes, scaling events. Most incidents start right after something changed. Make this visible in one second.

Dependency health. A healthy app talking to a dead database is not healthy. Dependencies often point to root cause faster than app metrics.

Golden signals. Latency, traffic, errors, saturation. These beat hundreds of infrastructure metrics.

Logs visible. Top exceptions, error spikes, failed endpoints. Reduce tab-switching during incidents.

Service map. Which services depend on the failing one? Visual dependency maps answer this instantly.

Alert state. Which alerts fired? Which started first? First alert usually beats alert #100 for root cause.

The Test

For every panel: How does this help me resolve the incident faster?

If the answer isn't obvious, remove it.

Example

EKS outage. Don't show cluster CPU and memory.

Show:

  • Failed requests by service
  • Pod restarts
  • Readiness failures
  • Recent deployments
  • HPA scaling events
  • Dependency latency
  • Top exceptions
  • Queue backlogs

One tells you the cluster exists. The other helps you fix it.

The Point

Monitoring dashboards tell you something broke.

Incident dashboards help you figure out why.

During an outage, only the second one matters.


DevOps #SRE #Monitoring #Kubernetes #IncidentResponse #Dashboard

Top comments (0)