Base62 Encoding: Turning Numbers into Short Strings

You have an ID: 192847561038. Nineteen digits. Ugly in a URL, hard to share, impossible to remember.

Same number in base62: 3dJ7kP2. Seven characters. Clean, compact, URL-safe.

Base62 encoding is how systems turn large numeric IDs into short, human-friendly strings.

Why Base62#

Base10 uses digits 0-9 (10 characters). Base16 (hex) uses 0-9 and a-f (16 characters). Base62 uses 0-9, a-z, and A-Z (62 characters). More characters per position means fewer positions needed.

Base10: 62^1 =          62 values in 1 char
Base62: 62^7 = 3,521,614,606,208 values in 7 chars

Seven characters of base62 give you 3.5 trillion unique values. That’s enough for most systems.

Why not base64? Base64 includes + and / which need URL encoding. Base62 is purely alphanumeric, safe everywhere without escaping.

The Implementation#

public class Base62 {
    private static final String ALPHABET =
        "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ";

    public static String encode(long number) {
        if (number == 0) return "0";
        StringBuilder sb = new StringBuilder();
        while (number > 0) {
            sb.append(ALPHABET.charAt((int) (number % 62)));
            number /= 62;
        }
        return sb.reverse().toString();
    }

    public static long decode(String encoded) {
        long result = 0;
        for (char c : encoded.toCharArray()) {
            result = result * 62 + ALPHABET.indexOf(c);
        }
        return result;
    }
}

Encode divides repeatedly by 62, mapping remainders to characters. Decode reverses it. Deterministic, lossless, fast.

ID-Based vs Hash-Based#

Two approaches to generating short codes:

ID-based: Generate a Snowflake ID, encode it in base62. Guaranteed unique because the ID is unique. Deterministic mapping.

Hash-based: Hash the input content (MD5, SHA-256), take the first 7 characters. Problem: collisions. Two different inputs can produce the same 7-character prefix. You need collision detection.

// Hash approach: collision risk
String hash = DigestUtils.md5Hex(content).substring(0, 7);
// What if this collides with an existing code?

You could check for collisions with a bloom filter before inserting. But the ID-based approach avoids this entirely.

graph TD ID[Snowflake ID: 192847561038] --> E[Base62 Encode] E --> S["3dJ7kP2"] S --> D[Base62 Decode] D --> ID2[192847561038] H[Hash of Content] --> T[Truncate to 7 chars] T --> COL{Collision?} COL -->|Yes| RE[Rehash or append] COL -->|No| S2["x8Km4pQ"] style ID fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style E fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style S fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style D fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style ID2 fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style H fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style T fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style COL fill:#000000,stroke:#ff0000,stroke-width:2px,color:#fff style RE fill:#000000,stroke:#ff0000,stroke-width:2px,color:#fff style S2 fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff

Read Path Optimization#

Short code comes in. Decode to numeric ID. Look up in database. The numeric ID is the primary key, so the lookup is an index seek. Fast.

But for high-traffic codes, the database call is unnecessary. Put a cache in front. Popular codes get served from Redis. The cache hit rate for this pattern is typically very high because a small percentage of codes get most of the traffic.

At Oracle, we used base62 for internal reference codes in our configuration system. Human-readable, compact enough to paste in Slack, and decodable back to the database ID without any lookup table. The encoding itself was never the bottleneck. It’s just math.

What I’m Learning#

Base62 is one of those small building blocks that shows up everywhere: short codes, reference IDs, session tokens, invite links. The encoding is trivial. The design decisions around it (ID-based vs hash-based, code length, collision handling) are where the real thinking happens.

Seven characters gives you trillions of combinations. For most systems, that’s more than enough.

What encoding scheme do you use for user-facing identifiers?