Posts for: #Architecture

Blue-Green Deployments

2026-04-28sohilladhani

#distributed-systems #system-design #architecture #patterns

Deploy the new version. Test it. Switch traffic. If something breaks, switch back. Instant rollback. Sounds ideal. The database migrations are where it gets complicated. The Pattern Blue-green runs two identical production environments. Blue is live. Green is idle. You deploy your new version to green. You test it against real infrastructure but with no live traffic. When you’re confident, you flip the load balancer to point to green. Green is now live.

Canary Releases

2026-04-27sohilladhani

#distributed-systems #system-design #architecture #patterns

CI passed. Staging tests passed. You’ve reviewed the code three times. Then you ship to production and something you never predicted breaks at scale. What Canary Means A canary release sends a small fraction of real traffic to the new version before switching everyone over. 1% of users hit v2, 99% hit v1. You watch your metrics. If v2 behaves well, you expand: 5%, then 20%, then 100%. If metrics degrade, you route that 1% back to v1 and investigate without anyone else affected.

Feature Flags

2026-04-26sohilladhani

#distributed-systems #system-design #architecture #patterns

You ship a feature. Three minutes later, on-call pings you: error rate spiked. You need to roll back. A full redeploy takes 20 minutes. With a feature flag, rollback takes 30 seconds. What a Flag Is A feature flag is a conditional in your code. If the flag is on, the new code path runs. If it’s off, the old behavior runs. The flag is a config value read at runtime, not at deploy time.

The Sidecar Pattern and Service Mesh

2026-04-25sohilladhani

#distributed-systems #system-design #architecture #microservices #kubernetes

Every team writes the same retry logic. The same circuit breaker boilerplate. The same mTLS handshake setup. The platform team changes the retry policy and now has to update 30 services. There’s a better way. The Sidecar Pattern A sidecar is a separate process running in the same pod as your service. It intercepts all network traffic in and out. Your service code is unchanged. The sidecar handles retries, timeouts, circuit breaking, load balancing, and observability.

Service Discovery

2026-04-24sohilladhani

#distributed-systems #system-design #architecture #microservices

Your service starts. It gets an IP. Three days later it restarts and gets a different IP. Every service that had the old IP hardcoded is now broken. This is why you need service discovery. The Problem With Static Config In a small system, hardcoding IPs in config files works. Then you move to containers. Containers restart, scale up, scale down. IPs change constantly. You need a way for services to find each other without knowing addresses in advance.

API Gateway Patterns

2026-04-23sohilladhani

#distributed-systems #system-design #architecture #microservices

You have 12 microservices. Every mobile client talks to all 12. Each service handles its own auth, its own rate limiting, its own CORS. Adding a 13th service means updating every client app. The gateway pattern fixes that. What a Gateway Does An API gateway sits between clients and your services. Clients make one call. The gateway routes it, authenticates the caller, applies rate limits, then proxies to the right service.

Feature Stores

2026-04-22sohilladhani

#distributed-systems #machine-learning #system-design #architecture

You train a model using yesterday’s data. You serve it using today’s data. The feature computation logic is slightly different between the two. The model degrades silently and you spend a week figuring out why. The Training-Serving Skew Problem ML models are trained on offline batches: historical data, features computed via Spark jobs, labels aggregated over time. At serving time, features are computed online: live data, lower latency budget, different code path.

Sequenced Writes

2026-04-16sohilladhani

#distributed-systems #system-design #ordering #architecture

Two events arrive out of order. You don’t know they’re out of order. You process them anyway. The system ends up in a state that never should have existed. Sequence Numbers as the Foundation A global sequence number assigned to every write event is the most direct solution to ordering problems. Event 1, event 2, event 3. If event 4 arrives after event 6, you know something is missing. You wait, or request a replay, rather than blindly processing forward.

Market Data Distribution

2026-04-15sohilladhani

#distributed-systems #system-design #architecture #streaming

Every trade generates a tick: a price, a volume, a timestamp. An active stock might generate thousands of ticks per second. Distributing that data to thousands of subscribers simultaneously is its own problem. What Tick Data Looks Like A tick is small: instrument ID, price, quantity, timestamp. The volume is the problem. During market open or a news event, tick rates spike dramatically. Subscribers range from high-frequency algorithms (latency-sensitive, need every tick) to dashboards (showing “current price,” don’t care about ticks they missed).

Order Matching Engine

2026-04-14sohilladhani

#distributed-systems #system-design #architecture #trading

A stock exchange doesn’t just record trades. It runs an algorithm that decides which buyer gets matched with which seller. That algorithm is the matching engine, and its design choices are unusually interesting. The Limit Order Book The core data structure is the limit order book (LOB): two sorted collections of orders, bids (buy orders) and asks (sell orders). Bids are sorted by price descending (highest buyer first), asks by price ascending (lowest seller first).

Storage Tiering

2026-04-10sohilladhani

#distributed-systems #storage #architecture #system-design

Most of your data is accessed once and then never again. Storing it on fast, expensive storage forever is just burning money. Hot, Warm, Cold The canonical model is three tiers based on access frequency. Hot storage (SSD-backed, high IOPS) handles recent data that’s accessed constantly. Warm storage (standard HDD or S3 Standard-IA) holds data accessed occasionally. Cold storage (archival, like Glacier) holds data that might never be touched again but legally must be retained.

Content-Addressable Storage

2026-04-08sohilladhani

#distributed-systems #storage #architecture #system-design

Two users upload the same 50 MB file. Naive storage keeps two copies. Content-addressable storage keeps one. What “Content-Addressable” Means Instead of locating data by where it lives (a path, a filename), you locate it by what it is. Hash the content, use the hash as the key. Same content, same hash, same storage location. SHA-256 a file and store the result as its address. The practical consequence: deduplication becomes automatic.

Offline-First Sync

2026-04-07sohilladhani

#distributed-systems #consistency #mobile #architecture #system-design

The field rep drove into a dead zone. The mobile app kept working: they filled out three forms, updated two account records, closed a deal. Forty minutes later, connectivity returned and the sync ran. Two of those records had been updated by a desktop user in the meantime. The mobile changes were silently dropped. No error. No prompt. Just gone. The Core Problem The client operates against a local snapshot while offline.

Revision History and Snapshotting

2026-04-06sohilladhani

#distributed-systems #storage #architecture #system-design

A user hits Ctrl+Z forty times and expects to land exactly where they were yesterday. That is not just undo. That is a complete audit trail of every edit, stored efficiently, queryable at any point in time. The naive approach: store a full copy of the document after every change. Works for ten users. Collapses at ten thousand. Deltas, Not Copies Instead of storing full document state after every edit, store only what changed: the operation (insert 3 chars at position 12, delete 5 chars at position 20).

Operational Transformation

2026-04-05sohilladhani

#distributed-systems #consistency #concurrency #architecture #system-design

Two users edit the same document simultaneously. User A inserts “X” at position 5. User B deletes the character at position 3. Apply both naively and the result is corrupted. The positions shifted when B’s deletion ran first, and A’s insertion lands in the wrong place. The Position Problem Operations encode positions at generation time, not application time. When document state changes between generation and application, positions are stale. Operational Transformation (OT) transforms an incoming op relative to already-applied ops before executing it.

Lambda and Kappa Architecture

2026-04-04sohilladhani

#distributed-systems #architecture #kafka #stream-processing #system-design

Real-time results are fast and approximate. Historical results are slow and accurate. The tension between them is where Lambda and Kappa architecture come from. Lambda: Two Pipelines Lambda runs two parallel systems. The batch layer processes all historical data on a schedule (Spark on HDFS, every few hours) and produces ground truth. The speed layer processes the live stream (Kafka Streams or Flink) for low-latency results. The serving layer merges both: “latest batch result plus stream delta since the last batch.

ZooKeeper Ephemeral Nodes

2026-04-01sohilladhani

#distributed-systems #zookeeper #concurrency #architecture

Redis locks expire after a TTL. If your process crashes, you wait up to 30 seconds for the lock to become available. ZooKeeper takes a different approach: lock it to the session, not a timer. Ephemeral Nodes ZooKeeper has two kinds of nodes: persistent (survive until explicitly deleted) and ephemeral (automatically deleted when the client session expires). A session is kept alive by a heartbeat. If the client crashes, heartbeats stop, the session expires after a configurable timeout, and the ephemeral node vanishes.

The Redlock Algorithm

2026-03-31sohilladhani

#distributed-systems #redis #concurrency #architecture

A single Redis instance holds your lock. Redis crashes. The lock entry is gone. But your client already received “acquired” before the crash and is happily running. Another client acquires the same lock on the recovered instance. Two lock holders. The single-instance Redis lock has a fundamental flaw. Quorum Locking Redlock is Redis creator Antirez’s answer. Instead of one Redis, use N independent instances (typically 5). To acquire the lock:

Testing Eventually Consistent Systems: When Assertions Need Patience

2026-03-26sohilladhani

#distributed-systems #system-design #testing #consistency #architecture

You write a record, immediately read it back, and assert equality. The test fails. Not because of a bug, but because the read hit a replica that hasn’t caught up yet. Your test is correct. Your assertion timing isn’t.

Contract Testing: Verifying Service Interactions Without E2E Tests

2026-03-25sohilladhani

#distributed-systems #system-design #testing #microservices #architecture

Team A changes their API response. Team B’s service breaks in production. The integration test suite passed because it was running against a mock from 3 months ago.

Chaos Engineering: Breaking Things on Purpose

2026-03-24sohilladhani

#distributed-systems #system-design #testing #resilience #architecture

Your system passed all tests. Every health check is green. You’re confident it handles failures. Then a network partition happens in production and everything falls apart. You never actually tested failure.

Consumer Group Rebalancing: The Partition Shuffle

2026-03-23sohilladhani

#distributed-systems #system-design #architecture #kafka #event-driven

You have 3 consumers reading from 6 Kafka partitions. One consumer crashes. The remaining 2 need to pick up its partitions. That handoff isn’t as smooth as you’d hope.

Log Compaction: Keeping the Latest Without Keeping Everything

2026-03-22sohilladhani

#distributed-systems #system-design #architecture #kafka #event-driven

Your event log has 100 million records. Key ‘user-42’ has been updated 500 times. You only care about the latest value. But deleting old entries would break consumers who haven’t caught up yet.

Merkle Trees: Detecting Differences Without Comparing Everything

2026-03-21sohilladhani

#distributed-systems #system-design #architecture #data-structures #replication

Two database replicas should have identical data. One has 50 million rows. Comparing row by row would take hours. Merkle trees find the differences by comparing a single hash.

Quorum Reads and Writes: Tuning Consistency with Math

2026-03-20sohilladhani

#distributed-systems #system-design #consistency #replication #architecture

Three replicas, one write. How many replicas need to acknowledge before the write is ‘done’? One? All three? The answer determines your consistency guarantees.

Push vs Pull Metrics Collection: Two Ways to Get the Numbers

2026-03-19sohilladhani

#distributed-systems #system-design #architecture #monitoring #microservices

Should your services push metrics to a collector, or should the collector pull metrics from your services? Sounds like a minor detail. It changes your entire monitoring architecture.

Downsampling: Keeping Trends, Not Every Data Point

2026-03-18sohilladhani

#distributed-systems #system-design #architecture #databases #monitoring

You’re storing metrics at 1-second granularity. After a year, that’s 31 million data points per metric. Nobody looks at second-level data from 6 months ago. But you still need the trends.

Time-Series Databases: Storage Built for Timestamps

2026-03-17sohilladhani

#distributed-systems #system-design #architecture #databases #monitoring

Your monitoring system ingests 100,000 metrics per second. Each is a timestamp, a name, and a value. A regular database buckles. Time-series databases are designed for exactly this shape of data.

Transcoding Pipelines: Processing Video at Scale

2026-03-16sohilladhani

#distributed-systems #system-design #architecture #java #distributed-processing

User uploads one video file. Your system needs to produce 240p, 480p, 720p, and 1080p versions, each with multiple audio tracks. That’s a distributed workflow problem.

Adaptive Bitrate Streaming: Adjusting Quality on the Fly

2026-03-15sohilladhani

#distributed-systems #system-design #architecture #performance #streaming

User starts watching in 1080p. They walk into an elevator. Bandwidth drops. The video freezes and buffers. Adaptive bitrate streaming would have dropped to 480p and kept playing.

[Older posts] >