Distributed ID Generation: Snowflake and Friends

Single database, auto-increment primary key. ID 1, 2, 3, 4. Simple. Unique. Ordered.

Now shard across 4 databases. Each one auto-increments independently. Shard A generates 1, 2, 3. Shard B generates 1, 2, 3. You now have duplicate IDs across the system.

The Options#

UUID (v4): 128-bit random value. Collision probability is astronomically low. No coordination needed.

String id = UUID.randomUUID().toString();
// "f47ac10b-58cc-4372-a567-0e02b2c3d479"

Problem: 36 characters, not sortable by time, poor index performance in MySQL (random values fragment B-tree indexes).

Database sequences with offsets: Shard A starts at 1, increments by 4. Shard B starts at 2, increments by 4. No collisions. But adding a 5th shard requires changing all the increment values. Fragile.

Snowflake IDs: The sweet spot.

Snowflake#

Twitter created this. A 64-bit integer that encodes three things:

| 41 bits: timestamp (ms since epoch) | 10 bits: machine ID | 12 bits: sequence |

Timestamp: Sortable by time. Newer IDs are always larger.
Machine ID: Each generator gets a unique ID (0-1023). No coordination between machines.
Sequence: 4096 IDs per millisecond per machine. Overflow? Wait for the next millisecond.

public class SnowflakeGenerator {
    private final long machineId;
    private long lastTimestamp = -1;
    private long sequence = 0;

    public synchronized long nextId() {
        long timestamp = System.currentTimeMillis();

        if (timestamp == lastTimestamp) {
            sequence = (sequence + 1) & 0xFFF; // 12-bit mask
            if (sequence == 0) {
                while (timestamp <= lastTimestamp) {
                    timestamp = System.currentTimeMillis();
                }
            }
        } else {
            sequence = 0;
        }

        lastTimestamp = timestamp;
        return (timestamp << 22) | (machineId << 12) | sequence;
    }
}

graph TD S[Snowflake ID: 64 bits] --> T[41 bits: Timestamp] S --> M[10 bits: Machine ID] S --> Q[12 bits: Sequence] T --> TS[Sortable by time] M --> NC[No coordination needed] Q --> TH[4096 IDs/ms/machine] style S fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style T fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style M fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style Q fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style TS fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style NC fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style TH fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff

The Clock Problem#

Snowflake depends on the system clock. If the clock jumps backward (NTP correction, VM migration), you might generate duplicate IDs. Twitter’s implementation throws an exception on clock regression. Others use a monotonic clock or tolerate a small backward drift.

This connects back to why Lamport timestamps exist: physical clocks aren’t reliable in distributed systems. Snowflake accepts this risk because clock jumps are rare and the alternative (coordinated ID generation) is slower.

When to Use What#

Approach	Sortable	Size	Coordination	Best For
UUID v4	No	128 bits	None	Simple systems, low volume
DB sequence	Yes	64 bits	Per-shard	Single-shard writes
Snowflake	Yes	64 bits	Machine ID only	High-throughput distributed

At Oracle, we used database sequences initially. Worked fine until we sharded the notifications table. Switched to a Snowflake-style generator. The time-sortable property was a bonus: we could paginate by ID instead of needing a separate timestamp index.

What I’m Learning#

ID generation seems trivial until you distribute it. The constraints are contradictory: globally unique, locally fast, time-sortable, compact. Snowflake threads the needle by encoding enough information in 64 bits to avoid coordination while maintaining order.

The machine ID assignment is the one coordination point. Most teams use ZooKeeper, a config file, or just hardcode it per deployment. Not elegant, but it works.

What ID strategy does your system use? Have you hit the UUID index performance problem?