Alerting hygiene — ntfy refactored into four channels
Context
The cluster's notifications (health, supervision, incidents, bus alerts, dashboards) all flowed through a single ntfy channel. One night, a storm caused by a flapping service sent dozens of notifications within a few minutes. The phone buzzing on loop made signal indistinguishable from noise.
Constraint
Three requirements:
- Never again endure a storm triggered by an identical repeating message.
- Be able to tell apart the register of a message just from the sound — infra coughing, database wobbling, AI pipeline waking up, dashboard animating.
- Keep a single stack: no piling up of new tools.
Decision
Four semantic topics, one ntfy engine:
dashboard— user-facing events, interesting to see, no urgency.mln-infra— node state, restarts, healthchecks.mln-db— databases, integrity, latency.mln-ia— vision pipelines, Ollama, queues.
On the server: a dedicated mln-services user with
one token per register; each service can only publish to its
authorized channel.
On the dedup side: a persisted cache on the producer. Before
publishing, the service hashes
(topic, title, body, 5-min window). If the hash
was already published in the window, the send is suppressed.
Persisted because services restart and used to lose their
in-RAM cache — that's precisely what had caused the original
storm.
Measurement
Before: measured peak of 47 notifications in 12 minutes for a single incident. After: maximum observed 3 distinct notifications for a comparable incident, followed by silence as long as the root cause doesn't evolve.
What remains
On-call doesn't get woken up by noise anymore. The audible distinction per channel lets you gauge in two seconds whether to get up or not. The persisted-dedup pattern has spread to other services that talk to external webhooks — same benefit everywhere.