Database Sharding: Splitting Data Across Machines
Single database hits limits. 1 billion users, 10 TB data. Can’t fit on one machine.
Split the data. But how you split determines whether queries are fast or slow.
Why Shard#
Vertical scaling limits: Can’t buy infinite RAM. 512GB is expensive, still not enough.
Query throughput: One server handles 10,000 queries/sec. Need 100,000? Need more servers.
Storage limits: Disk full. Need more space.
Solution: partition data across multiple servers (shards). Each shard handles subset of data.
Hash-Based Sharding#
Hash the partition key. Result determines which shard.
int shard = hash(userId) % numberOfShards;
Example: 3 shards, user_id as partition key.
- user_123: hash(123) % 3 = 0 → Shard 0
- user_456: hash(456) % 3 = 0 → Shard 0
- user_789: hash(789) % 3 = 1 → Shard 1
Pros:
- Even distribution (if hash function is good)
- No hotspots (celebrity user data spread out)
- Predictable shard location
Cons:
- Range queries hit all shards (get users 100-200? Check all shards)
- Rebalancing is expensive (add shard 4? Rehash everything)
- Related data scattered (user’s posts on different shards)
users: 123, 456, 789] H -.-> S1[Shard 1
users: 234, 567] H -.-> S2[Shard 2
users: 345, 678] S0 --> R[Return user 456 data] style Q fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style H fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style S0 fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style S1 fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style S2 fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style R fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff
Hash-based sharding: hash determines shard, single-key queries are fast.
Range-Based Sharding#
Partition by key ranges.
Example: 3 shards, user_id ranges.
- Shard 0: user_id 1-1000000
- Shard 1: user_id 1000001-2000000
- Shard 2: user_id 2000001-3000000
Pros:
- Range queries efficient (get users 100-200? Hit Shard 0 only)
- Related data together (user 100’s posts likely on same shard)
- Easy to add shard (split range, no rehashing)
Cons:
- Hotspots (new users all go to last shard)
- Sequential IDs cause uneven load (signup spike hits one shard)
- Need to track range mappings
Range: 1-1M
Contains 100-200] R -.-> S1[Shard 1
Range: 1M-2M] R -.-> S2[Shard 2
Range: 2M-3M] S0 --> RT[Return users 100-200] style Q fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style R fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style S0 fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style S1 fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style S2 fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style RT fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff
Range-based sharding: range queries hit single shard.
Choosing Partition Key#
Good partition keys:
- High cardinality (many unique values)
- Evenly distributed
- Queried frequently
Bad partition keys:
- Low cardinality (status: active/inactive → 2 shards only)
- Timestamp (recent data always on one shard)
- Celebrity user (one shard gets all their traffic)
Example - E-commerce:
❌ Bad: shard by product_category (electronics shard overwhelmed during sales)
✅ Good: shard by order_id (evenly distributed)
❌ Bad: shard by created_at (recent orders all on one shard)
Rebalancing#
Add new shard? Data needs to move.
Hash-based rebalancing:
Before: hash(key) % 3
After: hash(key) % 4
Almost all data moves. Expensive.
Better: Consistent hashing with virtual nodes (covered earlier). Add shard, only ~1/N data moves.
Range-based rebalancing: Split ranges when shard gets too large.
Before: Shard 0 (1-2M)
After: Shard 0 (1-1M), Shard 3 (1M-2M)
Only affected range moves.
Cross-Shard Queries#
Query needs data from multiple shards? Scatter-gather.
-- Find all orders for user_123
-- User's orders might be on any shard if sharded by order_id
Application queries all shards, merges results. Slow.
Solutions:
- Denormalize (duplicate user data on each shard with their orders)
- Secondary index table (tracks which shards have data for each user)
- Redesign schema (shard by user_id instead)
What Complexity Looks Like#
Worked with a client considering sharding PostgreSQL. Single instance hitting limits (500GB, queries slowing).
Questions that came up:
- Shard by user_id or order_id? User_id keeps user data together but hotspot for power users.
- Hash or range? Hash for even distribution but range queries become scatter-gather.
- How to handle JOINs across shards? Denormalize or move to application layer?
- What about transactions spanning shards? Not supported, need distributed transactions or saga pattern.
Decided: Not ready for sharding complexity yet. Scaled vertically to 1TB, added read replicas, partitioned tables within single database. Bought 2 more years before needing sharding.
Sharding is powerful but adds massive complexity. Don’t do it until vertical scaling exhausted.
How do you decide when to shard?