Engineering
How We Reduced False Positives by 99%
Michael Ross
2025-01-05
6 min read
Alert fatigue is the single biggest threat to effective incident response. If your phone buzzes every night for a "blip" that resolves itself in 10 seconds, you stop checking. Then, when the real outage hits, you sleep right through it.
The "View from One Spot" Problem
Traditional monitoring checks from a single location. If a router in Virginia has a hiccup, your monitor in Virginia thinks your server in Frankfurt is down. It's not. It's just the path between them.
Our Solution: The Quorum Protocol
We implemented a consensus algorithm for all downtime alerts:
- Phase 1: Detection. Primary region detects a failure (timeout, 5xx, etc.).
- Phase 2: Verification. Primary immediately requests verification from 3 other random global regions.
- Phase 3: Consensus. If at least 2 other regions confirm the failure, the alert is triggered.
The Results
Since rolling this out, customer-reported false positives have dropped by 99.2%. We handle the network jitter so you don't have to.
This adds milliseconds to the alert time, but saves hours of sleep. We think that's a trade worth making.