Mrinal Narang

Posted on Jun 24

Dashboard Design for Incident Response

#programming #career #tutorial #devops

Most dashboards answer one question: Is everything okay?

During an incident, nobody's asking that.

The real question: What broke, where, and what changed?

Most dashboards fail at incidents because they were built for monitoring, not troubleshooting.

The Problem

A typical dashboard shows CPU, memory, disk, network, requests, uptime.

Useful for routine checks.

During an outage? Just noise.

You're not looking for reassurance. You're looking for evidence.

Two Different Jobs

Most teams put everything on one dashboard. That's a compromise that doesn't work for either job.

Monitoring dashboard: Is the platform healthy? SLAs being met? Resources used correctly?

Incident dashboard: What failed? When? What changed? Where do I look next?

Same tools, different purposes.

What Works During an Outage

Error rate front and center. 5XX errors, exceptions, failed transactions. Failures tell the story faster than CPU metrics.

Timeline on the graph. Mark deployments, infrastructure changes, scaling events. Most incidents start right after something changed. Make this visible in one second.

Dependency health. A healthy app talking to a dead database is not healthy. Dependencies often point to root cause faster than app metrics.

Golden signals. Latency, traffic, errors, saturation. These beat hundreds of infrastructure metrics.

Logs visible. Top exceptions, error spikes, failed endpoints. Reduce tab-switching during incidents.

Service map. Which services depend on the failing one? Visual dependency maps answer this instantly.

Alert state. Which alerts fired? Which started first? First alert usually beats alert #100 for root cause.

The Test

For every panel: How does this help me resolve the incident faster?

If the answer isn't obvious, remove it.

Example

EKS outage. Don't show cluster CPU and memory.

Show:

Failed requests by service
Pod restarts
Readiness failures
Recent deployments
HPA scaling events
Dependency latency
Top exceptions
Queue backlogs

One tells you the cluster exists. The other helps you fix it.

The Point

Monitoring dashboards tell you something broke.

Incident dashboards help you figure out why.

During an outage, only the second one matters.

DevOps #SRE #Monitoring #Kubernetes #IncidentResponse #Dashboard

DEV Community