State Machines: Making Distributed Workflows Predictable

You have a workflow: create, process, complete. You model it with boolean flags: isProcessed, isCompleted, isFailed. Then someone asks: can a record be both processed and failed? Your code says yes. Your business logic says no. Welcome to impossible states.

Explicit States, Explicit Transitions#

Replace flags with a single state field and a set of valid transitions.

public enum WorkflowState {
    CREATED, PROCESSING, COMPLETED, FAILED;

    private static final Map<WorkflowState, Set<WorkflowState>> TRANSITIONS = Map.of(
        CREATED,    Set.of(PROCESSING, FAILED),
        PROCESSING, Set.of(COMPLETED, FAILED),
        COMPLETED,  Set.of(),       // terminal
        FAILED,     Set.of(CREATED)  // allow retry
    );

    public WorkflowState transition(WorkflowState target) {
        if (!TRANSITIONS.get(this).contains(target)) {
            throw new IllegalStateException(this + " -> " + target + " not allowed");
        }
        return target;
    }
}

Now the code enforces business rules. CREATED can’t jump to COMPLETED. COMPLETED can’t go to FAILED. The compiler doesn’t catch this, but the runtime does, immediately and loudly.

graph TD C[CREATED] --> P[PROCESSING] C --> F[FAILED] P --> CO[COMPLETED] P --> F F --> C style C fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style P fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style CO fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style F fill:#000000,stroke:#ff0000,stroke-width:2px,color:#fff

Recovery#

When a service crashes mid-workflow, where does it pick up? With boolean flags, you’re guessing. With a state machine, you query: “show me everything in PROCESSING state that hasn’t been updated in 5 minutes.” Those are your stuck items. Transition them back to CREATED for retry, or to FAILED for investigation.

This is different from event sourcing, which replays the full history to reconstruct state. A state machine just stores the current state and transitions forward. Simpler, but you lose the audit trail (unless you log transitions separately).

State Machines and Sagas#

A saga is a state machine with compensation. Each step has a forward action and a rollback action. The saga coordinator tracks which state the workflow is in and knows exactly which compensations to run on failure. Without explicit states, compensating “whatever happened” is error-prone.

At Oracle, the NSSF registration workflow was originally a chain of if-else blocks checking isRegistered, isConfigured, isActive. Three booleans, eight possible combinations, only four were valid. We refactored to an enum state machine with five explicit states. Bugs from impossible state combinations disappeared. The DLQ entries dropped too because we stopped processing records in invalid states.

What I’m Learning#

State machines aren’t fancy. They’re just enums with rules. But they make workflows debuggable. When something fails, you know exactly where it was. When you add a new step, you know exactly which transitions to add. The alternative, scattered boolean checks and string comparisons, works until it doesn’t. And it usually stops working at 3 AM.

How do you model multi-step workflows in your system?