03 · case study

Alerting hygiene — ntfy refactored into four channels

Year: 2026 Domains: ops · observability Tool: ntfy.sh self-hosted

1 → 4semantic channels

0on-call storm since

persisteddeduplication

Context

The cluster's notifications (health, supervision, incidents, bus alerts, dashboards) all flowed through a single ntfy channel. One night, a storm caused by a flapping service sent dozens of notifications within a few minutes. The phone buzzing on loop made signal indistinguishable from noise.

Constraint

Three requirements:

Never again endure a storm triggered by an identical repeating message.
Be able to tell apart the register of a message just from the sound — infra coughing, database wobbling, AI pipeline waking up, dashboard animating.
Keep a single stack: no piling up of new tools.

Decision

Four semantic topics, one ntfy engine:

dashboard — user-facing events, interesting to see, no urgency.
mln-infra — node state, restarts, healthchecks.
mln-db — databases, integrity, latency.
mln-ia — vision pipelines, Ollama, queues.

On the server: a dedicated mln-services user with one token per register; each service can only publish to its authorized channel.

On the dedup side: a persisted cache on the producer. Before publishing, the service hashes (topic, title, body, 5-min window). If the hash was already published in the window, the send is suppressed. Persisted because services restart and used to lose their in-RAM cache — that's precisely what had caused the original storm.

Measurement

Before: measured peak of 47 notifications in 12 minutes for a single incident. After: maximum observed 3 distinct notifications for a comparable incident, followed by silence as long as the root cause doesn't evolve.

What remains

On-call doesn't get woken up by noise anymore. The audible distinction per channel lets you gauge in two seconds whether to get up or not. The persisted-dedup pattern has spread to other services that talk to external webhooks — same benefit everywhere.

← All case studies