User opens the app. Show the nearest 10 coffee shops. Sounds simple until you realize ’nearest’ means computing distance against millions of locations in under 100ms.
Posts for: #System-Design
Quadtrees: When Fixed Grids Aren’t Enough
Manhattan has 50,000 restaurants. Rural Wyoming has 3 per county. A fixed-size grid wastes cells on empty space and overloads dense areas. Quadtrees adapt.
Geohashing: Turning Coordinates into Searchable Strings
Your user is at latitude 37.7749, longitude -122.4194. Your database has 10 million locations. A full table scan comparing every coordinate pair isn’t going to work.
Work Stealing: Dynamic Load Balancing Without a Coordinator
You split work evenly across 4 threads. Two finish in 10ms, two take 10 seconds. Half your CPU sits idle while the other half grinds. Work stealing fixes this.
Delayed Message Delivery: Execute This in 30 Minutes
Send a reminder in 24 hours. Retry this job in 5 minutes. Expire this hold at midnight. Delayed execution is everywhere, and Thread.sleep isn’t the answer.
Leader Election: Picking One Node to Rule
Three nodes, one job. Without leader election, all three run it simultaneously. With leader election, exactly one does the work while the others stand by.
MapReduce: Processing Data That Won’t Fit on One Machine
Your dataset is 10TB. One machine can’t hold it, let alone process it. MapReduce splits the work across hundreds of machines with a deceptively simple API.
Trie Data Structures: Prefix Search in Milliseconds
User types three characters and expects instant suggestions. A hash map can’t do prefix lookups. A trie can, in O(k) time where k is the query length.
Inverted Indexes: How Search Actually Works
A normal index maps documents to words. An inverted index maps words to documents. That reversal is why search is fast.
Checkpointing: Resuming Long-Running Jobs Without Starting Over
A batch job runs for three hours and crashes at hour two. Without checkpointing, you restart from zero. With it, you lose ten minutes of work.
Content Fingerprinting: Detecting Near-Duplicates at Scale
Exact duplicates are easy. Near-duplicates are hard. SimHash turns documents into compact fingerprints where similar content produces similar hashes.
Priority Queues in Distributed Systems
FIFO queues treat every message equally. But urgent config updates shouldn’t wait behind a thousand bulk sync jobs. Priority queues fix this, if you handle starvation.
Reconciliation: When Your Systems Disagree
Your database says one thing. The external system says another. Reconciliation is how you find the drift before your users do.
State Machines: Making Distributed Workflows Predictable
Boolean flags and status strings create impossible states. An explicit state machine tells you exactly where a workflow is, what transitions are valid, and how to recover.
Optimistic vs Pessimistic Concurrency: Locks vs Versions
Two users update the same row. Pessimistic locking blocks one until the other finishes. Optimistic locking lets both try and fails the loser. Choosing wrong kills either throughput or correctness.
Two-Phase Commit: The Original Distributed Transaction
Two-phase commit guarantees atomicity across multiple databases. It also blocks everything if the coordinator dies. Here’s why microservices moved on.
Input Validation and Abuse Prevention in Distributed Systems
Every public write endpoint is an abuse vector. Layered defense with validation, rate limiting, and async scanning keeps your system safe without killing performance.
Approximate Counting: HyperLogLog and Count-Min Sketch
Counting unique items across billions of events. A HashSet needs gigabytes. HyperLogLog does it in 12KB. The trick is accepting a little error.
SLOs and Error Budgets: When Good Enough is a Number
100% availability is impossible and pursuing it wastes engineering time. SLOs turn reliability into a number you can reason about.
Base62 Encoding: Turning Numbers into Short Strings
A 64-bit integer is 19 digits. Encode it in base62 and it’s 7 characters. The math behind compact, URL-safe identifiers.
Distributed ID Generation: Snowflake and Friends
Auto-increment IDs break the moment you have more than one database. Snowflake IDs, UUIDs, and database sequences each solve this differently.
Event Aggregation: When 47 Notifications Become One
Showing every individual event overwhelms users. Grouping related events into summaries is a distributed systems problem hiding as a UX problem.
Social Graphs at Scale: Storing Relationships in MySQL
A follows table with two columns seems trivial. Until you need to query it from both directions, across shards, for millions of users.
Relevance Scoring: Why Chronological Order Breaks Down
Showing content in time order is simple until your users follow thousands of sources. Scoring and ranking turns a firehose into a useful stream.
Pre-Signed URLs: Uploading Files Without Touching Your Servers
Routing file uploads through your API server is a scaling bottleneck. Pre-signed URLs let clients upload directly to object storage.
Presence Systems: Who’s Online and How You Know
Green dot means online. Simple, right? Behind that dot is a distributed system making heartbeat-based guesses about user liveness.
Cursor-Based Pagination: Why Offset Breaks at Scale
OFFSET 50000 makes MySQL scan 50,000 rows just to skip them. Cursor pagination stays fast no matter how deep you go.
Fan-Out Strategies: Write-Time vs Read-Time
User posts an update. Do you push it to all followers immediately, or let them pull it when they check? The trade-off shapes your entire architecture.
WebSockets vs Long Polling: Choosing a Real-Time Transport
Your client needs real-time updates from the server. HTTP wasn’t built for this. Here’s how long polling, SSE, and WebSockets solve it differently.
Read Replicas: Hidden Consistency Traps
You added read replicas to scale reads. Now users update their profile and see the old version. Welcome to replica lag.