Posts for: #Distributed-Systems
Hot Key Detection and Mitigation
Cache Eviction Policies
Testing Eventually Consistent Systems: When Assertions Need Patience
You write a record, immediately read it back, and assert equality. The test fails. Not because of a bug, but because the read hit a replica that hasn’t caught up yet. Your test is correct. Your assertion timing isn’t.
Contract Testing: Verifying Service Interactions Without E2E Tests
Team A changes their API response. Team B’s service breaks in production. The integration test suite passed because it was running against a mock from 3 months ago.
Chaos Engineering: Breaking Things on Purpose
Your system passed all tests. Every health check is green. You’re confident it handles failures. Then a network partition happens in production and everything falls apart. You never actually tested failure.
Consumer Group Rebalancing: The Partition Shuffle
You have 3 consumers reading from 6 Kafka partitions. One consumer crashes. The remaining 2 need to pick up its partitions. That handoff isn’t as smooth as you’d hope.
Log Compaction: Keeping the Latest Without Keeping Everything
Your event log has 100 million records. Key ‘user-42’ has been updated 500 times. You only care about the latest value. But deleting old entries would break consumers who haven’t caught up yet.
Merkle Trees: Detecting Differences Without Comparing Everything
Two database replicas should have identical data. One has 50 million rows. Comparing row by row would take hours. Merkle trees find the differences by comparing a single hash.
Quorum Reads and Writes: Tuning Consistency with Math
Three replicas, one write. How many replicas need to acknowledge before the write is ‘done’? One? All three? The answer determines your consistency guarantees.
Push vs Pull Metrics Collection: Two Ways to Get the Numbers
Should your services push metrics to a collector, or should the collector pull metrics from your services? Sounds like a minor detail. It changes your entire monitoring architecture.
Downsampling: Keeping Trends, Not Every Data Point
You’re storing metrics at 1-second granularity. After a year, that’s 31 million data points per metric. Nobody looks at second-level data from 6 months ago. But you still need the trends.
Time-Series Databases: Storage Built for Timestamps
Your monitoring system ingests 100,000 metrics per second. Each is a timestamp, a name, and a value. A regular database buckles. Time-series databases are designed for exactly this shape of data.
Transcoding Pipelines: Processing Video at Scale
User uploads one video file. Your system needs to produce 240p, 480p, 720p, and 1080p versions, each with multiple audio tracks. That’s a distributed workflow problem.
Adaptive Bitrate Streaming: Adjusting Quality on the Fly
User starts watching in 1080p. They walk into an elevator. Bandwidth drops. The video freezes and buffers. Adaptive bitrate streaming would have dropped to 480p and kept playing.
CDN and Edge Caching: Serving Content from Next Door
Your origin server is in us-east-1. Your user is in Mumbai. That’s 200ms of latency before a single byte transfers. CDNs put your content on a server down the street.
Proximity Search: Finding What’s Nearby at Scale
User opens the app. Show the nearest 10 coffee shops. Sounds simple until you realize ’nearest’ means computing distance against millions of locations in under 100ms.
Quadtrees: When Fixed Grids Aren’t Enough
Manhattan has 50,000 restaurants. Rural Wyoming has 3 per county. A fixed-size grid wastes cells on empty space and overloads dense areas. Quadtrees adapt.
Geohashing: Turning Coordinates into Searchable Strings
Your user is at latitude 37.7749, longitude -122.4194. Your database has 10 million locations. A full table scan comparing every coordinate pair isn’t going to work.
Work Stealing: Dynamic Load Balancing Without a Coordinator
You split work evenly across 4 threads. Two finish in 10ms, two take 10 seconds. Half your CPU sits idle while the other half grinds. Work stealing fixes this.
Delayed Message Delivery: Execute This in 30 Minutes
Send a reminder in 24 hours. Retry this job in 5 minutes. Expire this hold at midnight. Delayed execution is everywhere, and Thread.sleep isn’t the answer.
Leader Election: Picking One Node to Rule
Three nodes, one job. Without leader election, all three run it simultaneously. With leader election, exactly one does the work while the others stand by.
MapReduce: Processing Data That Won’t Fit on One Machine
Your dataset is 10TB. One machine can’t hold it, let alone process it. MapReduce splits the work across hundreds of machines with a deceptively simple API.
Inverted Indexes: How Search Actually Works
A normal index maps documents to words. An inverted index maps words to documents. That reversal is why search is fast.
Checkpointing: Resuming Long-Running Jobs Without Starting Over
A batch job runs for three hours and crashes at hour two. Without checkpointing, you restart from zero. With it, you lose ten minutes of work.
Content Fingerprinting: Detecting Near-Duplicates at Scale
Exact duplicates are easy. Near-duplicates are hard. SimHash turns documents into compact fingerprints where similar content produces similar hashes.
Priority Queues in Distributed Systems
FIFO queues treat every message equally. But urgent config updates shouldn’t wait behind a thousand bulk sync jobs. Priority queues fix this, if you handle starvation.
Reconciliation: When Your Systems Disagree
Your database says one thing. The external system says another. Reconciliation is how you find the drift before your users do.
State Machines: Making Distributed Workflows Predictable
Boolean flags and status strings create impossible states. An explicit state machine tells you exactly where a workflow is, what transitions are valid, and how to recover.
Optimistic vs Pessimistic Concurrency: Locks vs Versions
Two users update the same row. Pessimistic locking blocks one until the other finishes. Optimistic locking lets both try and fails the loser. Choosing wrong kills either throughput or correctness.