Distributed Locks: When One Process Must Win
You have a cron job that sends daily reports. You deploy to 5 servers. Now you have 5 cron jobs sending 5 reports.
You need a lock. Only one server should run the job.
On a single machine, this is easy. synchronized in Java. threading.Lock in Python. But your servers don’t share memory. They need to agree on who holds the lock.
Welcome to distributed locking. It’s harder than it looks.
The Naive Approach#
Redis has SETNX (set if not exists). Simple lock:
SETNX job:daily-report "server-1"
EXPIRE job:daily-report 60
If the key doesn’t exist, you got the lock. Run your job. Delete the key when done.
DEL job:daily-report
What could go wrong?
Everything#
Problem 1: Crash before delete
Server-1 gets the lock. Server-1 crashes. The key sits there until expiry. Other servers wait 60 seconds doing nothing.
Problem 2: Slow job
Server-1 gets the lock with 60-second expiry. The job takes 90 seconds. At second 60, the lock expires. Server-2 grabs it. Now both are running.
taking longer than expected Note over R: 60 seconds pass
Lock expires S2->>R: SETNX lock R-->>S2: OK, you have lock Note over S1,S2: BOTH running the job!
Problem 3: Delete someone else’s lock
Server-1’s lock expires. Server-2 acquires it. Server-1 finishes and calls DEL. Server-1 just deleted Server-2’s lock. Server-3 grabs it. Chaos.
Better: Check Before Delete#
Store your identity in the lock. Only delete if it’s yours.
-- Lua script for atomic check-and-delete
if redis.call("GET", KEYS[1]) == ARGV[1] then
return redis.call("DEL", KEYS[1])
else
return 0
end
This prevents deleting someone else’s lock. But the slow job problem remains.
Redlock#
Redis creator Antirez proposed Redlock. Instead of one Redis instance, use 5 independent instances.
To acquire a lock:
- Get current time
- Try to acquire lock on all 5 instances with same key and random value
- Lock is acquired if you got it on majority (3+) and total time taken is less than lock TTL
- If failed, release lock on all instances
The idea: even if 2 instances fail or have network issues, the majority still agree.
The controversy: Martin Kleppmann wrote a famous critique. Redlock assumes bounded clock drift and network delays. In practice, these assumptions break. GC pauses, NTP jumps, network partitions can all cause two processes to believe they hold the lock.
I’m not going to declare a winner. Both have valid points. The takeaway: distributed locks are fundamentally limited by the laws we saw in the Two Generals Problem.
Fencing Tokens#
Here’s what actually makes distributed locks safe: fencing tokens.
Every time a lock is acquired, generate a monotonically increasing token. The lock holder must include this token in all operations.
Lock acquired: token = 34
Lock acquired: token = 35 (new holder)
The resource being protected (database, API) rejects operations with old tokens.
Server-1 pauses. Its lock expires. Server-2 gets token 35. Server-1 wakes up, tries to write with token 34. Database rejects it.
The lock isn’t the safety mechanism. The token is.
When to Avoid Locks#
Distributed locks are a last resort. Ask first:
Can you make it idempotent? If running twice produces the same result, maybe you don’t need a lock. Let all servers run, ignore duplicates.
Can you use a queue? Put jobs in a queue. One consumer processes each job. No lock needed.
Can you partition? Server-1 handles users A-M. Server-2 handles N-Z. No overlap, no lock.
Locks add complexity, latency, and failure modes. Avoid them if you can.
What I’m Learning#
I used to reach for locks whenever I needed “only one.” Now I treat them as a code smell.
The pattern that changed my thinking: if you need a lock, you probably need a fencing token too. The lock alone isn’t enough. And if you need fencing tokens, you’re building something complex. Is there a simpler design?
Sometimes the answer is still “yes, I need a lock.” Cron jobs, leader election, resource reservation. But I ask the question now instead of assuming.
Have you ever had a distributed lock fail in production? What broke?