Skip to content

Anomaly alerts that don't cry wolf at 3am

Per-metric Z-scores, quiet hours, and on-call routing — how to make Slack-native anomaly detection that operators actually trust.

There’s a special kind of damage you do to an analytics function when your alerts page someone at 3 am for a thing that isn’t real. After the second one, the recipient mutes the channel. After the third, they stop trusting any number from your warehouse at all.

So before we ship anomaly alerts, we should be honest about what we want them to do: catch real things, quietly, and never wake someone up for a noise spike at 3 am. Easier to write than to build.

The three rules I follow

1. Per-metric thresholds, not one global Z-score

Different metrics have different shapes. Order count has weekly seasonality. Latency has hourly seasonality. Daily revenue is bursty around marketing sends. A single Z-score formula will be wrong on at least one of these.

def z_for(metric: str, value: float) -> float:
    series = recent(metric, lookback_days=28)
    deseasonalized = remove_seasonality(series, period=metric_period(metric))
    mu, sigma = stats(deseasonalized)
    return (value - mu) / sigma

2. Quiet hours, by metric, by audience

Operational alerts can fire 24/7. Finance alerts wait until 8 am local. Product alerts wait until standup. None of this is technical — it’s social.

3. Routing through on-call, not a free-for-all channel

A #data-alerts channel that no one owns is a channel no one reads. Pager-style routing — alert → on-call → ack → escalate — is what turns alerts into something operators trust.

What changed when I did this

False-positive rate dropped under 5 %. Time-to-acknowledge dropped from “sometime next morning” to under 10 minutes during business hours. The number of times someone said “is the alert real?” before acting dropped to roughly zero.

That last metric is the one that matters.