The Redlock Algorithm

A single Redis instance holds your lock. Redis crashes. The lock entry is gone. But your client already received “acquired” before the crash and is happily running. Another client acquires the same lock on the recovered instance. Two lock holders. The single-instance Redis lock has a fundamental flaw.

Quorum Locking#

Redlock is Redis creator Antirez’s answer. Instead of one Redis, use N independent instances (typically 5). To acquire the lock:

Record the current time.
Try to acquire the lock on each instance sequentially with the same key and a random value.
Count how many instances returned success.
If you got N/2+1 (majority quorum) and the total acquisition time is less than the lock’s TTL, the lock is valid.

The validity time matters: valid_time = TTL - (t2 - t1) - clock_drift_margin. If acquiring all instances took 28 of your 30-second TTL, you effectively have a 2-second lock. Probably not worth using.

%%{init: {'theme':'base', 'themeVariables': { 'primaryColor':'#000000','primaryTextColor':'#00ff00','primaryBorderColor':'#00ff00','lineColor':'#00ff00','secondaryColor':'#000000','tertiaryColor':'#000000','noteBkgColor':'#000000','noteBorderColor':'#00ff00','noteTextColor':'#00ff00'}}}%% sequenceDiagram autonumber participant C as Client participant R1 as Redis 1 participant R2 as Redis 2 participant R3 as Redis 3 C->>R1: SET lock val NX PX 30000 R1-->>C: OK C->>R2: SET lock val NX PX 30000 R2-->>C: OK C->>R3: SET lock val NX PX 30000 R3-->>C: nil (already held) Note over C: 2/3 acquired within TTL, lock valid

The Debate#

Martin Kleppmann argued Redlock is still unsafe: a GC pause or clock skew can cause a lock holder to run past its validity window without knowing. Antirez’s rebuttal: Redlock is designed for “efficiency” locks (avoiding duplicate work), not “correctness” locks (preventing data corruption). For correctness, use fencing tokens.

For most distributed job scheduling scenarios, Redlock is good enough. If you need stronger guarantees, ZooKeeper is the right tool.

At Salesforce#

We had a single Redis lock for distributing cron job execution across instances. During a Redis failover (primary died, replica promoted within a few seconds), the replica hadn’t replicated the lock key yet. Two job instances both got “acquired.” The job ran twice, sending duplicate notifications to roughly 40,000 users. Redlock across 3 independent Redis instances would have made this far less likely. That incident moved “distributed locking strategy” from “good idea” to “actual roadmap item.”

What I’m Learning#

Redlock trades some complexity for resilience against single-instance failure. Whether you need that trade-off depends on your tolerance for duplicate work.

Have you used Redlock in production, or do you rely on single-instance Redis locks and accept the edge case risk?