SWIM: Failure Detection at Scale
100 nodes in a cluster. Every node needs to know when another node fails. If every node heartbeats every other node, that’s 9,900 heartbeat streams. At scale, this becomes the majority of your network traffic.
How SWIM Works SWIM (Scalable Weakly-consistent Infection-style Membership) uses gossip-based dissemination with indirect probing. Instead of every node monitoring every other node, each node monitors a small random subset. When a node suspects another has failed, it asks a few other nodes to probe the suspect on its behalf.