Two-phase commit guarantees atomicity across multiple databases. It also blocks everything if the coordinator dies. Here’s why microservices moved on.
Posts for: #Distributed-Systems
Input Validation and Abuse Prevention in Distributed Systems
Every public write endpoint is an abuse vector. Layered defense with validation, rate limiting, and async scanning keeps your system safe without killing performance.
Approximate Counting: HyperLogLog and Count-Min Sketch
Counting unique items across billions of events. A HashSet needs gigabytes. HyperLogLog does it in 12KB. The trick is accepting a little error.
SLOs and Error Budgets: When Good Enough is a Number
100% availability is impossible and pursuing it wastes engineering time. SLOs turn reliability into a number you can reason about.
Distributed ID Generation: Snowflake and Friends
Auto-increment IDs break the moment you have more than one database. Snowflake IDs, UUIDs, and database sequences each solve this differently.
Event Aggregation: When 47 Notifications Become One
Showing every individual event overwhelms users. Grouping related events into summaries is a distributed systems problem hiding as a UX problem.
Presence Systems: Who’s Online and How You Know
Green dot means online. Simple, right? Behind that dot is a distributed system making heartbeat-based guesses about user liveness.
Fan-Out Strategies: Write-Time vs Read-Time
User posts an update. Do you push it to all followers immediately, or let them pull it when they check? The trade-off shapes your entire architecture.
WebSockets vs Long Polling: Choosing a Real-Time Transport
Your client needs real-time updates from the server. HTTP wasn’t built for this. Here’s how long polling, SSE, and WebSockets solve it differently.
Thundering Herd
Cache expires. 10,000 requests hit the database simultaneously. Your DB collapses. How request coalescing and probabilistic expiration prevent the stampede.
Structured Logging in Distributed Systems
Grep through 50 log files to find one request. Or use structured logging with correlation IDs and find it in seconds.
Tail Latency: The P99 Problem
Your average latency looks great. Your P99 is a disaster. Why tail latency matters more than averages, and what you can actually do about it.
Ordering Guarantees in Event-Driven Systems
Events arrive out of order. User updates overwrite each other. Here’s how partition keys, sequence numbers, and causal ordering keep things straight.
Dead Letter Queues
Your consumer retried a bad message 10,000 times. It will never succeed. Dead letter queues catch the messages that can’t be processed so the rest of your system keeps moving.
Making Consumers Idempotent
Exactly-once delivery is impossible across boundaries. Here’s the pattern that actually works: at-least-once delivery with idempotent consumers.
Exactly-Once Delivery is a Lie
Kafka says exactly-once. Your consumer processed the message twice anyway. Here’s why exactly-once is impossible across system boundaries.
Timeouts: The Hardest Easy Problem
Why setting timeouts is harder than it looks. Cascading failures, timeout budgets, and the art of picking the right number.
Distributed Locks: When One Process Must Win
Why distributed locking is harder than it looks. Naive Redis locks, Redlock, fencing tokens, and when to avoid locks entirely.
CRDTs: Data Structures That Never Conflict
How CRDTs let distributed systems merge updates without coordination. The math that makes ‘conflict-free’ possible.
Gossip Protocols: How Rumors Keep Systems Alive
How distributed systems spread information without a central coordinator. The surprisingly effective technique of random peer-to-peer chatter.
Vector Clocks and Lamport Timestamps
How distributed systems track ‘what happened before what’ without trusting wall clocks. Lamport timestamps for ordering, vector clocks for detecting conflicts.
Raft: The Understandable Consensus Algorithm
How distributed systems agree on state. A practical look at Raft’s Leader Election and Log Replication, finally making sense of consensus.
The CAP Theorem: The Cliché I Tried to Avoid
Why the CAP Theorem is the most misunderstood rule in system design. Addressing the ‘Pick 2’ lie and how it sets the stage for consensus algorithms.
Distributed Tracing: Finding the Needle in the Haystack
When a request vanishes into a maze of 10 microservices. How Distributed Tracing and OpenTelemetry keep you from going insane during an outage.
Materialized Views: The Read Optimization Pattern
Why standard views are just aliases and how materialized views act as an ‘in-database cache’ to solve the cross-shard query problem.
Two Generals Problem: Why Consensus is Impossible
The thought experiment that proves distributed consensus can’t be guaranteed over unreliable networks. Why acknowledgments create infinite regress and what it means for real systems.
Database Sharding: Splitting Data Across Machines
How to partition database across multiple servers. Hash-based vs range-based sharding, rebalancing strategies, and the complexity that comes with it.
Rate Limiting: Token Bucket vs Leaky Bucket
Protecting services from overload with rate limiting. Token bucket and leaky bucket algorithms explained with Java implementations and real-world trade-offs.
Backpressure: When Consumers Can’t Keep Up
Handling slow consumers in distributed systems. Queue growth, memory exhaustion, and strategies for applying backpressure - rejection, rate limiting, and flow control.
Retry Strategies: Exponential Backoff and Jitter
How to retry failed requests without overwhelming servers. Exponential backoff, jitter, and when to give up. Java implementations and real-world patterns.