Let me paint you a picture.
It's 3am. PagerDuty fires. You roll out of bed, open your laptop, and stare at a wall of dashboards. CPU looks fine. Memory looks fine. Error rate is... actually, what is that spike? Is that new? Was it doing that last week? You open five more tabs. You check the logs. There are 847,000 log lines from the last hour alone.
By the time you find the actual problem, forty minutes have passed. The issue was a single misconfigured timeout on a downstream service. It was logged. It was technically visible the entire time.
You just couldn't see it.
This is modern observability in most engineering organizations — not a lack of data, but a catastrophic excess of it, organized in ways that serve tools rather than humans. Let's talk about what's actually broken, and what fixing it looks like in practice.
Problem #1: Your Metrics Are Measuring the Wrong Layer
Here's something counterintuitive: most teams have more metrics than they need and fewer insights than they think.
The average Kubernetes cluster emits thousands of metrics per minute. Node CPU. Pod restarts. Container memory limits. Network bytes in, network bytes out. It's comprehensive. It's also, for the most part, completely useless during an incident — because none of it answers the question your users are implicitly asking, which is: is the product working?
The metrics that matter are embarrassingly simple:
The four signals that actually tell you something:
- Latency → how long are requests taking?
- Error rate → how many are failing?
- Throughput → how much traffic are we serving?
- Saturation → how close are we to the limit?
Google called these the Four Golden Signals in the SRE Book and published it nearly a decade ago. Most teams have read it. Most teams still have dashboards that bury these numbers under seventeen infrastructure panels about disk I/O on nodes that haven't caused an incident in three years.
Start from what users experience. Work backwards to infrastructure. Not the other way around.
Problem #2: Logs Are a Landfill With a Search Box
Nobody designed their logging strategy. What actually happened is this: a developer added a log line in 2019, it was useful once, and now every service emits a structured JSON blob on every function call because someone read that structured logging was best practice and went slightly overboard.
The result is a system where finding signal requires knowing exactly what you're looking for — which means logs are only useful after you already have a hypothesis. That's backwards. Logs should generate hypotheses, not confirm them.
Three things that actually fix this:
Correlation IDs, everywhere, without exception. Every request that enters your system gets a unique ID. That ID travels through every service call, every queue message, every background job spawned by that original request. When something breaks, you pull one ID and see the entire story. Without this, debugging a distributed system is like solving a murder mystery where all the witnesses are in different countries and don't speak the same language.
Log levels that actually mean something. If everything is INFO, nothing is INFO. Establish a standard and enforce it in code review. DEBUG is for development. INFO is for state transitions a human might genuinely care about. WARN is for handled anomalies. ERROR is for things that need human attention. FATAL means wake someone up. Teams that log at INFO by default in production are generating noise at industrial scale.
Retention policies that match reality. You do not need millisecond-queryable access to logs from eight months ago. Hot storage for seven days. Warm for thirty. Archive everything else to object storage at a fraction of the cost, retrievable if you ever actually need it. Most teams never need it.
Problem #3: Alerts Are a Cry-Wolf Machine
Ask any senior engineer what they do when PagerDuty fires on a Saturday afternoon. If they're being honest, they'll tell you: they check if it's one of those alerts, and if it is, they acknowledge it and go back to what they were doing.
That's not laziness. That's rational adaptation to a broken system.
Alert fatigue is not a people problem. It is an instrumentation problem. When alerts fire on conditions that don't require human action, humans learn to ignore alerts. When humans learn to ignore alerts, the one alert that genuinely requires action gets ignored too. This is how you end up with a P1 incident that technically fired six alerts before anyone noticed.
The fix requires some uncomfortable honesty about your current alert configuration:
For every alert that exists, answer two questions:
1. When this fires at 2am, what specific action should the on-call take?
2. In the last 90 days, did this alert lead to that action, or was it silenced?
If you can't answer question one, delete the alert.
If the answer to question two is "silenced," delete the alert.
Alerts should be opinionated. They should fire when a human needs to do something specific, right now, that cannot wait until morning. Everything else — degraded performance, elevated error rates below SLO thresholds, resource utilization trending in a bad direction — belongs in a dashboard that gets reviewed during business hours, not in someone's ear at 3am.
Problem #4: Tracing Is the Tool Everyone Deploys and Nobody Uses
Distributed tracing is, in theory, the solution to the "I have logs from twelve services and I can't tell what happened" problem. You instrument your services, traces flow through Jaeger or Tempo or X-Ray, and suddenly the entire request lifecycle is visible as a single coherent picture.
In practice, most teams have tracing deployed and barely use it.
The reason is almost always one of two things. Either sampling is too aggressive — you're capturing 1% of traces, so the specific request that broke is almost never in the sample — or the traces exist but nobody built the habit of looking at them during incidents because the dashboards came first and tracing came later as an afterthought.
Sampling strategy matters more than most documentation admits:
Head-based sampling (decide at request start):
→ Simple, low overhead, but you'll miss rare errors
Tail-based sampling (decide after request completes):
→ Captures errors and slow requests reliably
→ Higher infrastructure cost, but worth it for production
Pragmatic middle ground:
→ Sample 100% of errors and requests above latency threshold
→ Sample 1-5% of everything else
If your tracing setup isn't capturing every error at 100%, you have a gap precisely where you need visibility most.
Problem #5: Your Observability Stack Has No Owner
Here's the organizational failure underneath all of the technical ones.
Observability infrastructure gets built by whoever needs it first. A backend team instruments their services. A platform team sets up Prometheus. Someone configures Grafana dashboards over a long weekend. A contractor installs the APM agent eighteen months ago and nobody is quite sure how it's configured.
The result is a system with no coherent design, no standards, and no single person who understands all of it. When something breaks at 2am, the question isn't just what is wrong with the application — it's also can I even trust what my monitoring is telling me right now.
Observability is infrastructure. It needs to be treated like infrastructure: owned, maintained, improved deliberately, and subject to the same reliability standards as the systems it monitors. A monitoring system that goes down during an incident is worse than no monitoring system at all — it creates false confidence at exactly the wrong moment.
Assign ownership. Define standards. Review the stack the same way you review application architecture. This is not glamorous work. It is, however, the difference between an on-call rotation that functions and one that slowly grinds your best engineers into people who are very good at ignoring alerts.
The Number That Actually Matters
Raw MTTR — mean time to recovery — is the metric most organizations track for incident management. It's fine as far as it goes.
The number worth obsessing over is mean time to understanding.
Recovery is often fast once you know what's wrong. It's the knowing that takes forty minutes at 2am. Every investment in observability — correlation IDs, meaningful alerts, proper tracing, log hygiene — is ultimately an investment in compressing that gap between something is wrong and I know exactly what and why.
The goal isn't more dashboards. It isn't more data. It's less time staring at information that doesn't answer the question you're actually asking.
Your system is telling you things constantly. The question is whether you've built the conditions to actually hear it.