You have 1,000 tasks and 4 worker threads. Split evenly: 250 each. Sounds fair. But task sizes aren’t uniform. Thread 1 gets 250 tiny tasks and finishes in a second. Thread 3 gets 250 heavy tasks and takes a minute. Threads 1 and 2 sit idle while 3 and 4 grind.

Static partitioning assumes equal task sizes. Work stealing doesn’t.

How It Works#

Each worker has its own deque (double-ended queue). Workers push and pop tasks from the front of their own deque. When a worker’s deque is empty, it steals from the back of another worker’s deque.

public class WorkStealingPool {
    private final Deque<Runnable>[] queues;
    private final Thread[] workers;

    private void workerLoop(int workerId) {
        while (running) {
            Runnable task = queues[workerId].pollFirst(); // own queue, front
            if (task == null) {
                task = stealFromOther(workerId);           // other queue, back
            }
            if (task != null) {
                task.run();
            }
        }
    }

    private Runnable stealFromOther(int thiefId) {
        for (int i = 0; i < queues.length; i++) {
            if (i != thiefId) {
                Runnable stolen = queues[i].pollLast(); // steal from back
                if (stolen != null) return stolen;
            }
        }
        return null;
    }
}
graph TD W1["Worker 1: queue empty"] -->|Steals| W3B["Worker 3: back of queue"] W2["Worker 2: queue empty"] -->|Steals| W4B["Worker 4: back of queue"] W3["Worker 3: processing front"] --> W3B W4["Worker 4: processing front"] --> W4B style W1 fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style W2 fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style W3 fill:#000000,stroke:#ff0000,stroke-width:2px,color:#fff style W4 fill:#000000,stroke:#ff0000,stroke-width:2px,color:#fff style W3B fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style W4B fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff

Why Front and Back?#

The owner takes from the front (LIFO). The thief steals from the back (FIFO). This minimizes contention: owner and thief rarely touch the same end of the deque. It also has a nice property: tasks at the back tend to be larger (they were added earlier, before being subdivided), so the thief gets a big chunk of work instead of a tiny task that’ll run out immediately.

Java’s ForkJoinPool#

You don’t need to build this yourself. Java’s ForkJoinPool implements work stealing. It’s what powers parallel streams and CompletableFuture.supplyAsync().

ForkJoinPool pool = new ForkJoinPool(4);
List<CompletableFuture<Void>> futures = tasks.stream()
    .map(task -> CompletableFuture.runAsync(task, pool))
    .toList();
CompletableFuture.allOf(futures.toArray(new CompletableFuture[0])).join();

The pool automatically balances load across workers. No manual partitioning needed.

When Not to Use It#

Work stealing adds overhead: stealing requires locking or CAS on another thread’s deque. If tasks are roughly equal in size, static partitioning is simpler and faster. Work stealing shines when task durations are unpredictable, which is most real-world workloads.

It’s also different from a centralized task queue where all workers pull from one place. Centralized queues create contention at the queue. Work stealing distributes the contention: each worker primarily uses its own queue, and stealing is the exception, not the norm.

At Oracle, we had a bulk validation job that split NSSF configs evenly across 4 threads. Some configs had 3 parameters, others had 50+. Two threads would finish in seconds while the others ran for minutes. Switched to a ForkJoinPool with work stealing. The fast threads automatically helped with the remaining heavy configs. Total job time dropped by about 40%. The code change was minimal: replace our ExecutorService with a ForkJoinPool.

What I’m Learning#

Work stealing is a reminder that static partitioning is an assumption, not a guarantee. Any time you split work “evenly” and observe uneven completion times, you have a load balancing problem that work stealing can solve. The beauty of ForkJoinPool is that you get this for free in Java without implementing the deque mechanics yourself.

Have you used work stealing in your systems, or do you stick with fixed partitioning?