Approximate Counting: HyperLogLog and Count-Min Sketch

Count the unique users who visited a page today. Simple: put every user ID in a HashSet, check the size.

100 million unique users. Each ID is 8 bytes. That’s 800MB of memory. For one counter. Now multiply by thousands of pages.

You can’t afford exact counting at this scale. But you can get 99.2% accuracy in 12KB.

HyperLogLog#

The idea is counterintuitive. Hash each item. Look at the binary representation. Count the leading zeros in the hash. More leading zeros means you’ve probably seen more unique items.

Why? If you flip a coin, getting 5 heads in a row (like 5 leading zeros) happens once every 32 trials on average. If you’ve seen a hash with 10 leading zeros, you’ve probably processed around 1,024 unique items.

HyperLogLog maintains multiple “registers” (buckets), each tracking the maximum leading zeros seen. The harmonic mean of all registers gives an accurate cardinality estimate.

// Redis does the heavy lifting
public long countUnique(String pageId, String userId) {
    redisTemplate.opsForHyperLogLog().add("visitors:" + pageId, userId);
    return redisTemplate.opsForHyperLogLog().size("visitors:" + pageId);
}

12KB of memory. 0.81% standard error. Billions of unique items.

graph TD HS[HashSet: Exact Count] --> M1[800MB for 100M items] HLL[HyperLogLog: ~99.2% Accurate] --> M2[12KB for 100M items] M1 --> E1[Exact but expensive] M2 --> E2[Approximate but tiny] style HS fill:#000000,stroke:#ff0000,stroke-width:2px,color:#fff style M1 fill:#000000,stroke:#ff0000,stroke-width:2px,color:#fff style E1 fill:#000000,stroke:#ff0000,stroke-width:2px,color:#fff style HLL fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style M2 fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style E2 fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff

Merging Counts#

The best part: HyperLogLog registers are mergeable. Count unique visitors on page A and page B separately. Merge them to get unique visitors across both pages combined. No double-counting.

// Merge daily counts into weekly
redisTemplate.opsForHyperLogLog().union(
    "visitors:page1:week",
    "visitors:page1:mon",
    "visitors:page1:tue",
    "visitors:page1:wed"
);

This is why HyperLogLog beats just sampling. Samples can’t be merged without losing accuracy. HLL registers merge cleanly. Perfect for multi-level aggregation.

Count-Min Sketch#

Different problem: how many times did a specific item appear? Not unique count, but frequency.

Count-Min Sketch uses multiple hash functions, each pointing to a row of counters. To increment, hash the item with each function, increment the corresponding counter. To query, take the minimum across all hash functions (minimizes overcounting from collisions).

// How many times was this item accessed?
public void increment(String item) {
    for (int i = 0; i < hashCount; i++) {
        int index = hash(item, i) % width;
        counters[i][index]++;
    }
}

public long estimate(String item) {
    long min = Long.MAX_VALUE;
    for (int i = 0; i < hashCount; i++) {
        int index = hash(item, i) % width;
        min = Math.min(min, counters[i][index]);
    }
    return min; // never underestimates, may overestimate
}

Use case: find the top-K most accessed items without storing every item’s count individually.

When Exact Counts Matter#

Not every count can be approximate. Financial transactions? Exact. Billing? Exact. User-facing “47 people liked this”? Nobody notices if it’s actually 46 or 48.

At Oracle, we tracked unique sessions hitting our NSSF endpoints for capacity planning. Exact counting meant a growing HashSet that ate memory during traffic spikes. Switching to HyperLogLog in Redis gave us stable 12KB per endpoint counter. Accuracy was within 1% of exact counts we spot-checked. Good enough for capacity decisions.

This is the same family of probabilistic data structures as bloom filters: trade a small amount of accuracy for massive memory savings.

What I’m Learning#

Approximate counting is a mindset shift. Engineers default to exact answers. But at scale, exact is expensive and often unnecessary. The question isn’t “how many exactly?” It’s “do I need to know it’s 1,847,293 or is knowing it’s roughly 1.85 million sufficient?”

For analytics, capacity planning, and display counts, approximate is almost always good enough.

What counting problems have you solved with probabilistic data structures?