Distributed Locks: When One Process Must Win

You have a cron job that sends daily reports. You deploy to 5 servers. Now you have 5 cron jobs sending 5 reports.

You need a lock. Only one server should run the job.

On a single machine, this is easy. synchronized in Java. threading.Lock in Python. But your servers don’t share memory. They need to agree on who holds the lock.

Welcome to distributed locking. It’s harder than it looks.

The Naive Approach#

Redis has SETNX (set if not exists). Simple lock:

SETNX job:daily-report "server-1"
EXPIRE job:daily-report 60

If the key doesn’t exist, you got the lock. Run your job. Delete the key when done.

DEL job:daily-report

What could go wrong?

Everything#

Problem 1: Crash before delete

Server-1 gets the lock. Server-1 crashes. The key sits there until expiry. Other servers wait 60 seconds doing nothing.

Problem 2: Slow job

Server-1 gets the lock with 60-second expiry. The job takes 90 seconds. At second 60, the lock expires. Server-2 grabs it. Now both are running.

%%{init: {'theme':'base', 'themeVariables': { 'primaryColor':'#000000','primaryTextColor':'#00ff00','primaryBorderColor':'#00ff00','lineColor':'#00ff00','secondaryColor':'#000000','tertiaryColor':'#000000'}}}%% sequenceDiagram autonumber participant S1 as Server 1 participant R as Redis participant S2 as Server 2 S1->>R: SETNX lock (TTL=60s) R-->>S1: OK, you have lock Note over S1: Job running...
taking longer than expected Note over R: 60 seconds pass
Lock expires S2->>R: SETNX lock R-->>S2: OK, you have lock Note over S1,S2: BOTH running the job!

Problem 3: Delete someone else’s lock

Server-1’s lock expires. Server-2 acquires it. Server-1 finishes and calls DEL. Server-1 just deleted Server-2’s lock. Server-3 grabs it. Chaos.

Better: Check Before Delete#

Store your identity in the lock. Only delete if it’s yours.

-- Lua script for atomic check-and-delete
if redis.call("GET", KEYS[1]) == ARGV[1] then
    return redis.call("DEL", KEYS[1])
else
    return 0
end

This prevents deleting someone else’s lock. But the slow job problem remains.

Redlock#

Redis creator Antirez proposed Redlock. Instead of one Redis instance, use 5 independent instances.

To acquire a lock:

Get current time
Try to acquire lock on all 5 instances with same key and random value
Lock is acquired if you got it on majority (3+) and total time taken is less than lock TTL
If failed, release lock on all instances

The idea: even if 2 instances fail or have network issues, the majority still agree.

The controversy: Martin Kleppmann wrote a famous critique. Redlock assumes bounded clock drift and network delays. In practice, these assumptions break. GC pauses, NTP jumps, network partitions can all cause two processes to believe they hold the lock.

I’m not going to declare a winner. Both have valid points. The takeaway: distributed locks are fundamentally limited by the laws we saw in the Two Generals Problem.

Fencing Tokens#

Here’s what actually makes distributed locks safe: fencing tokens.

Every time a lock is acquired, generate a monotonically increasing token. The lock holder must include this token in all operations.

Lock acquired: token = 34
Lock acquired: token = 35 (new holder)

The resource being protected (database, API) rejects operations with old tokens.

%%{init: {'theme':'base', 'themeVariables': { 'primaryColor':'#000000','primaryTextColor':'#00ff00','primaryBorderColor':'#00ff00','lineColor':'#00ff00','secondaryColor':'#000000','tertiaryColor':'#000000'}}}%% sequenceDiagram autonumber participant S1 as Server 1 participant Lock as Lock Service participant DB as Database S1->>Lock: Acquire lock Lock-->>S1: token=34 Note over S1: GC pause... Note over Lock: Lock expires, S2 gets token=35 S1->>DB: Write (token=34) DB-->>S1: REJECTED: token too old

Server-1 pauses. Its lock expires. Server-2 gets token 35. Server-1 wakes up, tries to write with token 34. Database rejects it.

The lock isn’t the safety mechanism. The token is.

When to Avoid Locks#

Distributed locks are a last resort. Ask first:

Can you make it idempotent? If running twice produces the same result, maybe you don’t need a lock. Let all servers run, ignore duplicates.

Can you use a queue? Put jobs in a queue. One consumer processes each job. No lock needed.

Can you partition? Server-1 handles users A-M. Server-2 handles N-Z. No overlap, no lock.

Locks add complexity, latency, and failure modes. Avoid them if you can.

What I’m Learning#

I used to reach for locks whenever I needed “only one.” Now I treat them as a code smell.

The pattern that changed my thinking: if you need a lock, you probably need a fencing token too. The lock alone isn’t enough. And if you need fencing tokens, you’re building something complex. Is there a simpler design?

Sometimes the answer is still “yes, I need a lock.” Cron jobs, leader election, resource reservation. But I ask the question now instead of assuming.

Have you ever had a distributed lock fail in production? What broke?