The Noisy Neighbor Problem
Tenant A generates 10x the normal query load for 20 minutes. Your database CPU spikes. Tenant B, doing nothing unusual, sees 5-second query times. Tenant B’s SLA is breached. Tenant A didn’t do anything wrong. This is the noisy neighbor problem.
Why It Happens#
Shared infrastructure means shared resources. CPU, memory, I/O, and network bandwidth are fungible. When one tenant consumes more than their share, others get less. In a single-tenant system, this is your own problem. In a multi-tenant system, it’s one customer’s problem affecting every other customer.
The problem compounds in databases. Heavy queries hold locks. Lock contention slows all queries, not just the ones from the noisy tenant. A single long-running transaction on Tenant A’s data can delay Tenant B’s completely unrelated writes.
Per-Tenant Rate Limiting#
The first defense is application-level rate limiting per tenant. Before a query hits the database, check whether this tenant has consumed their query budget in the current window. If they have, return a 429 or queue the request rather than letting it through.
This protects the database but moves the problem to the queue. The noisy tenant degrades their own experience, not others'.
Infrastructure-Level Isolation#
For compute, Kubernetes resource quotas cap CPU and memory per namespace. Map namespaces to tenants and you get OS-level enforcement that application code can’t bypass.
For databases, per-tenant connection pool limits prevent one tenant from holding all the connections. Query timeouts ensure a runaway query doesn’t hold resources indefinitely. These two controls together handle most noisy neighbor cases without needing physical isolation.
At Salesforce#
I saw this firsthand on the CRM platform. Certain enterprise customers ran large batch exports, sometimes millions of records, against the same read replicas as interactive users. These exports would saturate replica I/O and push query times across the board above acceptable thresholds. The fix was routing exports to a dedicated replica set. If that replica was saturated, exports slowed down, but interactive queries from all tenants were unaffected. The separation of batch and interactive workloads was the key, not just rate limiting.
What I’m Learning#
Detecting the noisy neighbor is easy in hindsight. Detecting it in real time requires per-tenant metrics: queries per second, P99 latency, CPU time attributed to each tenant. Without per-tenant visibility, you know there’s a problem but not whose load is causing it.
Do you track resource consumption per tenant, and can you identify the noisy neighbor when something spikes?