Hot Key Detection and Mitigation

Redis is single-threaded per instance. One key receiving 50,000 reads per second will pin a single CPU core and nothing else on that shard gets processed fast.

This is the hot key problem. Unlike a database where you might add replicas or indexes, a single Redis key is owned by a single shard. Traffic concentration on that key concentrates CPU on that node.

Detection is straightforward: redis-cli --hotkeys scans keyspace and reports access frequency. Redis also supports OBJECT FREQ on individual keys if you’re using the LFU policy. The harder part is doing something about it once you find one.

Three mitigations that actually work. First: JVM local cache. Put a short-TTL (2-5 second) in-process cache in front of Redis for that specific key. Most reads never hit the network at all. The tradeoff is stale data, but for config-style keys that rarely change, 2 seconds of staleness is fine. Second: read replicas with READONLY. Redis Cluster supports replica reads; your client can fan out reads across the primary and replicas for the hot key. Third: key sharding. Instead of one feature_flags key, write identical data to feature_flags_1 through feature_flags_N and route reads randomly across the set.

%%{init: {'theme':'base', 'themeVariables': { 'primaryColor':'#000000','primaryTextColor':'#00ff00','primaryBorderColor':'#00ff00','lineColor':'#00ff00','secondaryColor':'#000000','tertiaryColor':'#000000','noteBkgColor':'#000000','noteBorderColor':'#00ff00','noteTextColor':'#00ff00'}}}%% sequenceDiagram autonumber participant C1 as Client participant L as JVM Local Cache participant R1 as Redis Shard key_1 participant R2 as Redis Shard key_2 participant R3 as Redis Shard key_3 C1->>L: Read hot_key L-->>C1: Hit (no Redis call) C1->>R1: Read hot_key_1 (cache miss) C1->>R2: Read hot_key_2 (cache miss) C1->>R3: Read hot_key_3 (cache miss) R1-->>C1: Value

At Salesforce, a shared Redis cluster had a single tenant config key taking 50,000 reads per second during a product launch. CPU on that shard pinned at 100%. Read latency jumped from 0.2ms to 15ms across the board, affecting every tenant on that node, not just the one with the hot key. We caught it with the Redis hotkeys monitor. The fix was a 3-second local cache in the app tier. Shard CPU dropped immediately; latency recovered within a minute.

The frustrating thing about hot keys: they’re often someone else’s key causing your latency.

What I’m Learning#

Key sharding adds write complexity since you have to keep N copies in sync. Local caching adds stale-data risk. How have you decided which mitigation to reach for first?