Cache expires. 10,000 requests hit the database simultaneously. Your DB collapses. How request coalescing and probabilistic expiration prevent the stampede.
Posts for: #System-Design
Structured Logging in Distributed Systems
Grep through 50 log files to find one request. Or use structured logging with correlation IDs and find it in seconds.
Database Migrations Without Downtime
ALTER TABLE on a 2M row table locks it for minutes. Your users see errors. Here’s how expand-contract and shadow writes let you migrate without downtime.
Tail Latency: The P99 Problem
Your average latency looks great. Your P99 is a disaster. Why tail latency matters more than averages, and what you can actually do about it.
Ordering Guarantees in Event-Driven Systems
Events arrive out of order. User updates overwrite each other. Here’s how partition keys, sequence numbers, and causal ordering keep things straight.
Dead Letter Queues
Your consumer retried a bad message 10,000 times. It will never succeed. Dead letter queues catch the messages that can’t be processed so the rest of your system keeps moving.
Making Consumers Idempotent
Exactly-once delivery is impossible across boundaries. Here’s the pattern that actually works: at-least-once delivery with idempotent consumers.
Exactly-Once Delivery is a Lie
Kafka says exactly-once. Your consumer processed the message twice anyway. Here’s why exactly-once is impossible across system boundaries.
Graceful Shutdown: Dying Without Dropping Requests
What happens to in-flight requests when you deploy? How to shut down cleanly without dropping connections or corrupting state.
Timeouts: The Hardest Easy Problem
Why setting timeouts is harder than it looks. Cascading failures, timeout budgets, and the art of picking the right number.
Distributed Locks: When One Process Must Win
Why distributed locking is harder than it looks. Naive Redis locks, Redlock, fencing tokens, and when to avoid locks entirely.
Connection Pooling: Why Opening Connections Is Expensive
The hidden cost of database connections. How connection pools work, why they matter, and how to size them without guessing.
Multi-Level Caching: L1, L2, and Beyond
Why one cache isn’t enough. How to layer local, distributed, and CDN caches for maximum performance without losing your mind on consistency.
Cache Stampede: When Expiry Causes Chaos
What happens when a popular cache key expires and thousands of requests hit your database at once. Three patterns to prevent the thundering herd.
Cache Invalidation: The Hard Problem
There are only two hard things in computer science: cache invalidation and naming things. Here’s why invalidation is so tricky, and what actually works.
Caching Patterns: Cache-Aside, Write-Through, and Friends
The four fundamental caching patterns every engineer should know. When to use cache-aside vs write-through vs write-behind vs read-through.
CRDTs: Data Structures That Never Conflict
How CRDTs let distributed systems merge updates without coordination. The math that makes ‘conflict-free’ possible.
Gossip Protocols: How Rumors Keep Systems Alive
How distributed systems spread information without a central coordinator. The surprisingly effective technique of random peer-to-peer chatter.
Vector Clocks and Lamport Timestamps
How distributed systems track ‘what happened before what’ without trusting wall clocks. Lamport timestamps for ordering, vector clocks for detecting conflicts.
The In-Memory Trap: Why Objects Are Slow
In-memory doesn’t always mean fast. How shifting from object-based to vector-based storage (Apache Arrow) delivered a 13x performance boost.
Raft: The Understandable Consensus Algorithm
How distributed systems agree on state. A practical look at Raft’s Leader Election and Log Replication, finally making sense of consensus.
The CAP Theorem: The Cliché I Tried to Avoid
Why the CAP Theorem is the most misunderstood rule in system design. Addressing the ‘Pick 2’ lie and how it sets the stage for consensus algorithms.
Distributed Tracing: Finding the Needle in the Haystack
When a request vanishes into a maze of 10 microservices. How Distributed Tracing and OpenTelemetry keep you from going insane during an outage.
Transactional Outbox: Solving the Dual Write Problem
Why your event-driven system is lying to you. Solving the ‘Dual Write’ problem using the Transactional Outbox pattern.
Materialized Views: The Read Optimization Pattern
Why standard views are just aliases and how materialized views act as an ‘in-database cache’ to solve the cross-shard query problem.
Saga Pattern: Managing Distributed Transactions
Why distributed ACID is a trap. Understanding choreography and orchestration sagas for long-running business processes.
Event Sourcing: Events as Source of Truth
Storing events instead of current state. How event sourcing works, rebuilding state from events, and when the complexity is worth it.
CQRS: Separating Reads from Writes
Command Query Responsibility Segregation - why you might want separate models for reading and writing data. When it helps, when it’s overkill, and implementation patterns.
Change Data Capture: Streaming Database Changes
How to capture and stream database changes in real-time. CDC patterns, implementation approaches, and when to use it instead of application-level events.
Two Generals Problem: Why Consensus is Impossible
The thought experiment that proves distributed consensus can’t be guaranteed over unreliable networks. Why acknowledgments create infinite regress and what it means for real systems.