SWIM: Failure Detection at Scale

100 nodes in a cluster. Every node needs to know when another node fails. If every node heartbeats every other node, that’s 9,900 heartbeat streams. At scale, this becomes the majority of your network traffic.

How SWIM Works#

SWIM (Scalable Weakly-consistent Infection-style Membership) uses gossip-based dissemination with indirect probing. Instead of every node monitoring every other node, each node monitors a small random subset. When a node suspects another has failed, it asks a few other nodes to probe the suspect on its behalf. If none of them can reach it either, the node is declared failed.

Indirect probing is the key insight. A single missed heartbeat could be a network glitch between two specific nodes. If three independent nodes can’t reach the suspect, it’s more likely the suspect is actually down.

graph TD A[Node A suspects Node D is down] --> B[Node A pings Node D - no response] B --> C[Node A asks Nodes B and C to probe Node D] C --> D{B or C can reach D?} D -->|Yes| E[Node D alive - was a network glitch] D -->|No| F[Node D marked suspect] F --> G[Gossip: broadcast suspect status] G --> H[No rebuttal within timeout: Node D declared failed] style A fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style B fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style C fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style D fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style E fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style F fill:#000000,stroke:#ff0000,stroke-width:2px,color:#fff style G fill:#000000,stroke:#ff0000,stroke-width:2px,color:#fff style H fill:#000000,stroke:#ff0000,stroke-width:2px,color:#fff

Suspicion State#

SWIM doesn’t immediately declare failure. A suspected node has a window to refute the suspicion. If it’s alive and receiving gossip, it broadcasts its own “I’m alive” message, and the suspicion is cancelled. This prevents false positives during GC pauses or momentary network hiccups.

The suspicion timeout is a trade-off. Short timeout means faster failure detection but more false positives. Long timeout means accurate detection but slower response.

Gossip Dissemination#

Membership changes propagate via gossip. When Node D is declared failed, a few nodes broadcast that fact to a few others. Each recipient forwards to a few more. Within O(log N) rounds, all nodes know. Compare this to O(N^2) heartbeat overhead in full-mesh approaches.

Cassandra uses a version of SWIM for cluster membership. HashiCorp’s Serf library, which backs Consul, is also SWIM-based.

At Oracle#

Our telecom cluster didn’t use SWIM, but we had a similar problem with leader election heartbeats. Every node monitored the leader directly. When we scaled from 5 nodes to 12, the leader received 11 heartbeat streams and was spending noticeable CPU on responses under load. Switching to a tree-based topology (each node monitors its parent, not the root) reduced leader load. SWIM solves this more elegantly by making monitoring local and spreading failure information via gossip.

What I’m Learning#

False positives in failure detection cause cascading problems. You route traffic away from a node that was just slow, it recovers, now it’s idle while the remaining nodes are overloaded. Tuning detection sensitivity is as important as having detection at all.

What failure detection mechanism does your cluster use, and have you hit false positive failures?