Anomaly alerts that don't cry wolf at 3am
Per-metric Z-scores, quiet hours, and on-call routing — how to make Slack-native anomaly detection that operators actually trust.
There’s a special kind of damage you do to an analytics function when your alerts page someone at 3 am for a thing that isn’t real. After the second one, the recipient mutes the channel. After the third, they stop trusting any number from your warehouse at all.
So before we ship anomaly alerts, we should be honest about what we want them to do: catch real things, quietly, and never wake someone up for a noise spike at 3 am. Easier to write than to build.
The three rules I follow
1. Per-metric thresholds, not one global Z-score
Different metrics have different shapes. Order count has weekly seasonality. Latency has hourly seasonality. Daily revenue is bursty around marketing sends. A single Z-score formula will be wrong on at least one of these.
def z_for(metric: str, value: float) -> float:
series = recent(metric, lookback_days=28)
deseasonalized = remove_seasonality(series, period=metric_period(metric))
mu, sigma = stats(deseasonalized)
return (value - mu) / sigma
2. Quiet hours, by metric, by audience
Operational alerts can fire 24/7. Finance alerts wait until 8 am local. Product alerts wait until standup. None of this is technical — it’s social.
3. Routing through on-call, not a free-for-all channel
A #data-alerts channel that no one owns is a channel no one reads. Pager-style
routing — alert → on-call → ack → escalate — is what turns alerts into something
operators trust.
What changed when I did this
False-positive rate dropped under 5 %. Time-to-acknowledge dropped from “sometime next morning” to under 10 minutes during business hours. The number of times someone said “is the alert real?” before acting dropped to roughly zero.
That last metric is the one that matters.