You shard your database to scale. You pick a shard key. If you pick something unrelated to tenant, queries for one tenant’s data scatter across all shards. If you pick tenant ID, all of one tenant’s data lands on one shard, and a large tenant can overwhelm it.
Why Tenant ID Makes Sense as a Shard Key Tenant isolation is the priority in a multi-tenant system. If all of Tenant A’s data is on Shard 2, a query for Tenant A’s records goes to Shard 2 only.
Tenant A generates 10x the normal query load for 20 minutes. Your database CPU spikes. Tenant B, doing nothing unusual, sees 5-second query times. Tenant B’s SLA is breached. Tenant A didn’t do anything wrong. This is the noisy neighbor problem.
Why It Happens Shared infrastructure means shared resources. CPU, memory, I/O, and network bandwidth are fungible. When one tenant consumes more than their share, others get less. In a single-tenant system, this is your own problem.
You’re building a SaaS product. Do you give each customer their own database? Put everyone in one? Somewhere in between? The answer affects cost, isolation, compliance, and how much operational pain you take on for the life of the product.
The Three Models Shared database, shared schema: all tenants in the same tables, with a tenant_id column. One database to manage. Lowest cost. The risk: a bug that forgets the tenant_id filter leaks one customer’s data to another.
Sometimes the change you need to make breaks compatibility. You can’t add a default. The field type genuinely needs to change. You can’t keep the old schema. Here’s what you do instead.
The New Topic Strategy The cleanest approach: create a new topic with the new schema. Producers write to both old and new topics in parallel. Consumers migrate to the new topic one by one. When all consumers are on the new topic, stop writing to the old one.
Kafka retains messages for days or weeks. Your consumer code will be updated independently of your producer. That means old messages need to be readable by new consumer code, and new messages need to be readable by old consumer code. You can’t just change a field.
What Backward and Forward Mean Backward compatibility: a new consumer can read messages written with the old schema. If you add an optional field with a default, old messages (which don’t have that field) are still valid.
Service A writes a Kafka message with field user_id. Service B reads it. Service A’s team renames it to userId next sprint. Service B starts throwing deserialization errors at runtime. Neither team knew about the other.
The Problem In a microservices system passing messages through Kafka, producers and consumers evolve independently. There’s no enforced contract. A producer can change a field name, add a required field, or change a data type, and the consumer finds out when deserialization fails in production.
100 nodes in a cluster. Every node needs to know when another node fails. If every node heartbeats every other node, that’s 9,900 heartbeat streams. At scale, this becomes the majority of your network traffic.
How SWIM Works SWIM (Scalable Weakly-consistent Infection-style Membership) uses gossip-based dissemination with indirect probing. Instead of every node monitoring every other node, each node monitors a small random subset. When a node suspects another has failed, it asks a few other nodes to probe the suspect on its behalf.
Node 3 is down. A write comes in that belongs there. You could reject it. Or you could accept it, hold it somewhere safe, and deliver it when Node 3 comes back.
What Hinted Handoff Does In a distributed database with replication, each write goes to a coordinator node, which forwards it to the nodes that own the data. If an owner is unreachable, the coordinator stores the write temporarily with a hint: “this write is intended for Node 3.
Your query is always “give me all events for user X, sorted by time.” A row-oriented database gives you rows where you pay for every column you didn’t ask for. Wide-column stores flip the model: you design the schema around your query, not the other way around.
How It Works In a wide-column store like Cassandra or HBase, the primary key has two parts: the partition key and the clustering key.
Deploy the new version. Test it. Switch traffic. If something breaks, switch back. Instant rollback. Sounds ideal. The database migrations are where it gets complicated.
The Pattern Blue-green runs two identical production environments. Blue is live. Green is idle. You deploy your new version to green. You test it against real infrastructure but with no live traffic. When you’re confident, you flip the load balancer to point to green. Green is now live.
CI passed. Staging tests passed. You’ve reviewed the code three times. Then you ship to production and something you never predicted breaks at scale.
What Canary Means A canary release sends a small fraction of real traffic to the new version before switching everyone over. 1% of users hit v2, 99% hit v1. You watch your metrics. If v2 behaves well, you expand: 5%, then 20%, then 100%. If metrics degrade, you route that 1% back to v1 and investigate without anyone else affected.
You ship a feature. Three minutes later, on-call pings you: error rate spiked. You need to roll back. A full redeploy takes 20 minutes. With a feature flag, rollback takes 30 seconds.
What a Flag Is A feature flag is a conditional in your code. If the flag is on, the new code path runs. If it’s off, the old behavior runs. The flag is a config value read at runtime, not at deploy time.
Every team writes the same retry logic. The same circuit breaker boilerplate. The same mTLS handshake setup. The platform team changes the retry policy and now has to update 30 services. There’s a better way.
The Sidecar Pattern A sidecar is a separate process running in the same pod as your service. It intercepts all network traffic in and out. Your service code is unchanged. The sidecar handles retries, timeouts, circuit breaking, load balancing, and observability.
Your service starts. It gets an IP. Three days later it restarts and gets a different IP. Every service that had the old IP hardcoded is now broken. This is why you need service discovery.
The Problem With Static Config In a small system, hardcoding IPs in config files works. Then you move to containers. Containers restart, scale up, scale down. IPs change constantly. You need a way for services to find each other without knowing addresses in advance.
You have 12 microservices. Every mobile client talks to all 12. Each service handles its own auth, its own rate limiting, its own CORS. Adding a 13th service means updating every client app. The gateway pattern fixes that.
What a Gateway Does An API gateway sits between clients and your services. Clients make one call. The gateway routes it, authenticates the caller, applies rate limits, then proxies to the right service.
You train a model using yesterday’s data. You serve it using today’s data. The feature computation logic is slightly different between the two. The model degrades silently and you spend a week figuring out why.
The Training-Serving Skew Problem ML models are trained on offline batches: historical data, features computed via Spark jobs, labels aggregated over time. At serving time, features are computed online: live data, lower latency budget, different code path.
“Find the 10 most similar items to this one” sounds simple. With millions of items represented as 256-dimensional vectors, exact search is too slow to be useful in production.
What Embeddings Are An ML model maps an item (a product, a document, a user’s history) to a dense numeric vector. The geometry of that vector space encodes semantic similarity: similar items land close together. You train the model on interaction data and the embeddings learn to represent “things that users treat similarly.
You don’t know what a user wants. But you know what people like them have wanted. That’s the intuition behind collaborative filtering.
The Two Approaches User-based CF finds users similar to you, then recommends what they liked. Item-based CF finds items similar to what you’ve already liked. Item-based is generally more stable because user behavior shifts rapidly (you might buy a couch once), while item similarity changes slowly (a couch is similar to other furniture regardless of who buys it).
You log out. Your JWT is still valid. The server has no record it was ever issued. This is the stateless token revocation problem.
Why Revocation Is Hard JWTs are stateless by design. The server validates a token by checking the signature and expiry. It doesn’t consult a database. This is what makes them fast and scalable. But it means there’s no central list of “valid tokens” to update when a token should no longer be accepted.
OAuth 2.0 is not an authentication protocol. It’s an authorization protocol. That confusion is the root of most OAuth misuse.
What OAuth Actually Does OAuth lets a user grant a third-party application limited access to their account without sharing their password. The user sees a consent screen listing what the app wants to access. They approve. The app gets a token with exactly those permissions. Your password never leaves the authorization server.
The server doesn’t remember you. Every request carries proof of who you are. That’s the point of a token.
The Structure A JWT is three base64url-encoded segments joined by dots: header, payload, signature. The header says which algorithm signed it. The payload carries claims: user ID, roles, expiry time. The signature is a cryptographic proof that the header and payload haven’t been tampered with.
The server doesn’t need a database lookup to verify a JWT.
Two events arrive out of order. You don’t know they’re out of order. You process them anyway. The system ends up in a state that never should have existed.
Sequence Numbers as the Foundation A global sequence number assigned to every write event is the most direct solution to ordering problems. Event 1, event 2, event 3. If event 4 arrives after event 6, you know something is missing. You wait, or request a replay, rather than blindly processing forward.
Every trade generates a tick: a price, a volume, a timestamp. An active stock might generate thousands of ticks per second. Distributing that data to thousands of subscribers simultaneously is its own problem.
What Tick Data Looks Like A tick is small: instrument ID, price, quantity, timestamp. The volume is the problem. During market open or a news event, tick rates spike dramatically. Subscribers range from high-frequency algorithms (latency-sensitive, need every tick) to dashboards (showing “current price,” don’t care about ticks they missed).
A stock exchange doesn’t just record trades. It runs an algorithm that decides which buyer gets matched with which seller. That algorithm is the matching engine, and its design choices are unusually interesting.
The Limit Order Book The core data structure is the limit order book (LOB): two sorted collections of orders, bids (buy orders) and asks (sell orders). Bids are sorted by price descending (highest buyer first), asks by price ascending (lowest seller first).
“Sent” is not “delivered.” “Delivered” is not “opened.” These are three different states and conflating them causes subtle bugs in badge counts and notification UIs.
The Delivery Gap APNs and FCM give you delivery confirmation at the gateway level, not the device level. You know the gateway accepted your payload. You don’t know if the device received it, displayed it, or was offline when it arrived.
For most notifications this is fine.
Your retry logic fires. The user gets the same notification twice. They think your app is broken. They’re not wrong.
The Problem with Retries Push delivery is at-least-once by design. Your server sends to APNs/FCM, the network hiccups, you don’t get a response, so you retry. APNs might have delivered the first one. The user now sees two identical alerts.
The fix lives at two levels: your server and the gateway.
You don’t send a push notification directly to a phone. You send it to Apple or Google, and they deliver it for you. That indirection has consequences most backend engineers don’t think about until something breaks.
APNs and FCM Apple Push Notification Service (APNs) handles iOS. Firebase Cloud Messaging (FCM) handles Android (and can handle iOS too). Your server maintains a persistent HTTP/2 connection to these gateways and submits payloads. The gateway handles the actual delivery to the device, retries if the device is offline, and tells you when a token is no longer valid.
Most of your data is accessed once and then never again. Storing it on fast, expensive storage forever is just burning money.
Hot, Warm, Cold The canonical model is three tiers based on access frequency. Hot storage (SSD-backed, high IOPS) handles recent data that’s accessed constantly. Warm storage (standard HDD or S3 Standard-IA) holds data accessed occasionally. Cold storage (archival, like Glacier) holds data that might never be touched again but legally must be retained.
You save a 200 MB file. One word changed. Re-uploading 200 MB to sync that change is absurd. Delta sync is how you avoid it.
The Core Idea Split the file into blocks. On an update, compare the new version’s blocks against the stored version’s blocks. Transfer only the blocks that changed.
Rsync pioneered this. It computes a fast rolling checksum for each block on the remote side, sends those checksums to the client, the client finds which local blocks match and which don’t, and transmits only the mismatches.
Two users upload the same 50 MB file. Naive storage keeps two copies. Content-addressable storage keeps one.
What “Content-Addressable” Means Instead of locating data by where it lives (a path, a filename), you locate it by what it is. Hash the content, use the hash as the key. Same content, same hash, same storage location. SHA-256 a file and store the result as its address.
The practical consequence: deduplication becomes automatic.