Boolean flags and status strings create impossible states. An explicit state machine tells you exactly where a workflow is, what transitions are valid, and how to recover.
Optimistic vs Pessimistic Concurrency: Locks vs Versions
Two users update the same row. Pessimistic locking blocks one until the other finishes. Optimistic locking lets both try and fails the loser. Choosing wrong kills either throughput or correctness.
Two-Phase Commit: The Original Distributed Transaction
Two-phase commit guarantees atomicity across multiple databases. It also blocks everything if the coordinator dies. Here’s why microservices moved on.
Input Validation and Abuse Prevention in Distributed Systems
Every public write endpoint is an abuse vector. Layered defense with validation, rate limiting, and async scanning keeps your system safe without killing performance.
Approximate Counting: HyperLogLog and Count-Min Sketch
Counting unique items across billions of events. A HashSet needs gigabytes. HyperLogLog does it in 12KB. The trick is accepting a little error.
SLOs and Error Budgets: When Good Enough is a Number
100% availability is impossible and pursuing it wastes engineering time. SLOs turn reliability into a number you can reason about.
Base62 Encoding: Turning Numbers into Short Strings
A 64-bit integer is 19 digits. Encode it in base62 and it’s 7 characters. The math behind compact, URL-safe identifiers.
Distributed ID Generation: Snowflake and Friends
Auto-increment IDs break the moment you have more than one database. Snowflake IDs, UUIDs, and database sequences each solve this differently.
Event Aggregation: When 47 Notifications Become One
Showing every individual event overwhelms users. Grouping related events into summaries is a distributed systems problem hiding as a UX problem.
Social Graphs at Scale: Storing Relationships in MySQL
A follows table with two columns seems trivial. Until you need to query it from both directions, across shards, for millions of users.
Relevance Scoring: Why Chronological Order Breaks Down
Showing content in time order is simple until your users follow thousands of sources. Scoring and ranking turns a firehose into a useful stream.
Pre-Signed URLs: Uploading Files Without Touching Your Servers
Routing file uploads through your API server is a scaling bottleneck. Pre-signed URLs let clients upload directly to object storage.
Presence Systems: Who’s Online and How You Know
Green dot means online. Simple, right? Behind that dot is a distributed system making heartbeat-based guesses about user liveness.
Cursor-Based Pagination: Why Offset Breaks at Scale
OFFSET 50000 makes MySQL scan 50,000 rows just to skip them. Cursor pagination stays fast no matter how deep you go.
Fan-Out Strategies: Write-Time vs Read-Time
User posts an update. Do you push it to all followers immediately, or let them pull it when they check? The trade-off shapes your entire architecture.
WebSockets vs Long Polling: Choosing a Real-Time Transport
Your client needs real-time updates from the server. HTTP wasn’t built for this. Here’s how long polling, SSE, and WebSockets solve it differently.
Read Replicas: Hidden Consistency Traps
You added read replicas to scale reads. Now users update their profile and see the old version. Welcome to replica lag.
Thundering Herd
Cache expires. 10,000 requests hit the database simultaneously. Your DB collapses. How request coalescing and probabilistic expiration prevent the stampede.
Structured Logging in Distributed Systems
Grep through 50 log files to find one request. Or use structured logging with correlation IDs and find it in seconds.
Database Migrations Without Downtime
ALTER TABLE on a 2M row table locks it for minutes. Your users see errors. Here’s how expand-contract and shadow writes let you migrate without downtime.
Tail Latency: The P99 Problem
Your average latency looks great. Your P99 is a disaster. Why tail latency matters more than averages, and what you can actually do about it.
Ordering Guarantees in Event-Driven Systems
Events arrive out of order. User updates overwrite each other. Here’s how partition keys, sequence numbers, and causal ordering keep things straight.
Dead Letter Queues
Your consumer retried a bad message 10,000 times. It will never succeed. Dead letter queues catch the messages that can’t be processed so the rest of your system keeps moving.
Making Consumers Idempotent
Exactly-once delivery is impossible across boundaries. Here’s the pattern that actually works: at-least-once delivery with idempotent consumers.
Exactly-Once Delivery is a Lie
Kafka says exactly-once. Your consumer processed the message twice anyway. Here’s why exactly-once is impossible across system boundaries.
Graceful Shutdown: Dying Without Dropping Requests
What happens to in-flight requests when you deploy? How to shut down cleanly without dropping connections or corrupting state.
Timeouts: The Hardest Easy Problem
Why setting timeouts is harder than it looks. Cascading failures, timeout budgets, and the art of picking the right number.
Distributed Locks: When One Process Must Win
Why distributed locking is harder than it looks. Naive Redis locks, Redlock, fencing tokens, and when to avoid locks entirely.
Connection Pooling: Why Opening Connections Is Expensive
The hidden cost of database connections. How connection pools work, why they matter, and how to size them without guessing.
Multi-Level Caching: L1, L2, and Beyond
Why one cache isn’t enough. How to layer local, distributed, and CDN caches for maximum performance without losing your mind on consistency.