You deploy your scheduled job across three instances for high availability. All three wake up at midnight and start the same batch process. Now you have triple the writes, conflicting updates, and a mess to clean up.

You need exactly one node to run the job. The others should wait and take over if it dies.

Lease-Based Election#

The simplest production approach: use a shared lock with a time limit (a lease).

public boolean tryBecomeLeader(String jobName, String nodeId) {
    Boolean acquired = redisTemplate.opsForValue()
        .setIfAbsent("leader:" + jobName, nodeId, Duration.ofSeconds(30));
    return Boolean.TRUE.equals(acquired);
}

public void renewLease(String jobName, String nodeId) {
    String currentLeader = redisTemplate.opsForValue().get("leader:" + jobName);
    if (nodeId.equals(currentLeader)) {
        redisTemplate.expire("leader:" + jobName, Duration.ofSeconds(30));
    }
}

Node A calls SETNX and gets the lease. Nodes B and C try, fail, and back off. Node A renews the lease every 10 seconds with a 30-second TTL. If Node A crashes, the lease expires after 30 seconds. B and C compete for the new lease.

%%{init: {'theme':'base', 'themeVariables': { 'primaryColor':'#000000','primaryTextColor':'#00ff00','primaryBorderColor':'#00ff00','lineColor':'#00ff00','secondaryColor':'#000000','tertiaryColor':'#000000','noteBkgColor':'#000000','noteBorderColor':'#00ff00','noteTextColor':'#00ff00'}}}%% sequenceDiagram autonumber participant A as Node A participant R as Redis participant B as Node B A->>R: SETNX leader:job nodeA TTL=30s R-->>A: OK (leader) B->>R: SETNX leader:job nodeB TTL=30s R-->>B: FAIL (not leader) Note over A: Processing... A->>R: EXPIRE leader:job 30s (renew) Note over A: Node A crashes Note over R: TTL expires after 30s B->>R: SETNX leader:job nodeB TTL=30s R-->>B: OK (new leader)

The Split-Brain Problem#

Network partition: Node A can’t reach Redis but is still running. Redis expires the lease. Node B acquires it. Now both A and B think they’re leader.

The fix: fencing tokens. Each time a lease is acquired, increment a monotonic counter. The leader includes this token in every write. Downstream systems reject writes with older tokens. Node A has token 5, Node B gets token 6. Any writes from A with token 5 are rejected.

This connects to Raft consensus, which solves the same problem more rigorously. Raft’s leader election uses term numbers (essentially fencing tokens) and requires majority agreement. More guarantees, but heavier machinery than a Redis lease.

When You Don’t Need a Leader#

Not every job needs leader election. If the job is idempotent, running it on multiple nodes simultaneously might be fine (duplicated work but correct results). Leader election matters when duplicate execution causes problems: sending duplicate notifications, processing payments twice, writing conflicting state.

At Oracle, we had exactly this problem. Two NSSF instances both ran a config sync job at 2 AM. Both read the same source data, both wrote to the same target tables. Most of the time it was fine because the writes were the same. But occasionally they’d interleave and one instance would overwrite the other’s partial update. Added a Redis lease: only the lease holder starts the sync. The standby instance checks every 30 seconds and takes over if the lease expires. No more conflicting writes.

What I’m Learning#

Leader election is one of those problems that seems simple until you think about failure modes. Getting the lease is easy. Handling the case where the leader is partitioned but alive, that’s where fencing tokens and consensus protocols earn their complexity. For most applications, a Redis lease with a reasonable TTL is good enough. For critical systems, consider something backed by Raft or ZooKeeper.

How does your team handle single-leader guarantees for scheduled jobs?