Checkpointing: Resuming Long-Running Jobs Without Starting Over

Your batch job processes 100,000 records. At record 87,000 it crashes. OOM, network timeout, pod eviction. Without checkpointing, you restart from record 1. With checkpointing, you restart from record 86,000.

The difference between losing three hours of work and losing ten minutes.

The Pattern#

Periodically save your position to durable storage. On restart, read the last checkpoint and resume from there.

public class CheckpointedProcessor {
    private final String jobId;

    public void run() {
        long startFrom = checkpointRepo.getPosition(jobId).orElse(0L);

        for (long offset = startFrom; offset < totalRecords; offset++) {
            process(offset);

            if (offset % CHECKPOINT_INTERVAL == 0) {
                checkpointRepo.save(jobId, offset);
            }
        }
        checkpointRepo.delete(jobId); // job complete
    }
}

Simple. The complexity is in choosing CHECKPOINT_INTERVAL.

The Frequency Trade-off#

Checkpoint every record? Zero wasted work on crash, but you’re writing to the database on every iteration. That overhead can double your job’s runtime.

Checkpoint every 10,000 records? Minimal overhead, but a crash costs you up to 10,000 records of reprocessing.

The right interval depends on how expensive each record is to process and how often crashes happen. For most batch jobs, checkpointing every few minutes or every N thousand records is the sweet spot.

graph TD S[Start / Resume] --> R[Read Last Checkpoint] R --> P[Process Records] P --> C{Checkpoint Interval?} C -->|Yes| SAVE[Save Position to DB] C -->|No| P SAVE --> P P --> CR{Crash?} CR -->|Yes| S CR -->|No| DONE[Job Complete] style S fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style R fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style P fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style C fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style SAVE fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style CR fill:#000000,stroke:#ff0000,stroke-width:2px,color:#fff style DONE fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff

You’ve Already Seen Checkpointing#

Kafka consumer offsets? Checkpointing. The consumer reads messages, processes them, and periodically commits the offset. On restart, it reads from the last committed offset, not from the beginning of the topic.

CDC log positions? Same pattern. The CDC reader tracks its position in the database’s change log. Crash and resume from the last saved position.

Even write-ahead logging is a form of checkpointing: the database saves its state so it can recover after a crash.

The Exactly-Once Problem#

You process record 87,000 and crash before saving the checkpoint. On restart, you reprocess 87,000. If processing has side effects (sending an email, updating a balance), you’ve done it twice. Checkpointing gives you at-least-once processing, not exactly-once. You still need idempotency to handle the overlap.

At Oracle, we had a bulk migration job that moved thousands of NSSF configs between environments. It ran for hours during maintenance windows. First time it crashed mid-run, we restarted from scratch and missed our window. Added checkpointing to MySQL: save the last processed config ID every 500 records. Next crash, we resumed in under a minute. The checkpoint table was one row per job. Took maybe 20 minutes to implement. Should have done it from the start.

What I’m Learning#

Checkpointing is one of those patterns that feels unnecessary until the first crash. Then it’s the only thing that matters. The implementation is usually trivial (save a number to a database). The hard part is remembering to build it in before you need it.

Do your batch jobs checkpoint their progress, or do they restart from scratch?