Downsampling: Keeping Trends, Not Every Data Point
Your monitoring system stores CPU usage every second. That’s 86,400 data points per day per metric. For 1,000 metrics across 200 services, you’re generating 17 billion data points per day. Storage isn’t free, and nobody will ever look at per-second data from three months ago.
But you can’t just delete it. “What was our error rate trend last quarter?” is a legitimate question. You need the trend without the granularity.
Tiered Retention#
Keep full resolution for recent data. Aggregate older data into coarser buckets.
- Last 24 hours: 1-second resolution (full detail)
- Last 7 days: 1-minute averages
- Last 30 days: 5-minute averages
- Last year: 1-hour averages
- Beyond a year: 1-day averages
Each tier reduces storage by the compression ratio of the aggregation. 1-second to 1-minute is a 60x reduction. 1-minute to 1-hour is another 60x. Total: 3,600x less data for old metrics.
Aggregation Isn’t Just Averages#
This tripped me up. If you only store averages, you lose critical information. Average latency was 50ms. Great. But was P99 200ms or 2000ms? You’ll never know.
Store multiple aggregates per bucket: min, max, average, count, and ideally P50/P95/P99. It costs more storage than a single average, but still orders of magnitude less than raw data.
INSERT INTO metrics_1min (metric_name, bucket_start, avg_val, min_val, max_val, count, p99_val)
SELECT
metric_name,
FROM_UNIXTIME(FLOOR(UNIX_TIMESTAMP(timestamp) / 60) * 60) AS bucket_start,
AVG(value), MIN(value), MAX(value), COUNT(*),
-- P99 approximation
PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY value)
FROM metrics_raw
WHERE timestamp BETWEEN ? AND ?
GROUP BY metric_name, bucket_start;
When Downsampling Lies#
A 5-minute outage is clearly visible in 1-second data. In 1-hour averages, it’s a barely noticeable dip. The average smooths out the spike. This is why you keep full resolution for recent data and only downsample what’s old enough that you only need trends.
SLO calculations are especially sensitive. Your monthly error budget math needs accurate error counts. If you’ve downsampled the error rate to hourly averages, you can’t reconstruct the exact number of failed requests. Keep raw error event data longer than other metrics, or pre-compute the SLO numbers before downsampling.
Running the Downsampler#
A background job runs periodically: scan raw data older than 24 hours, aggregate into 1-minute buckets, delete the raw data. Same pattern as compaction in LSM-trees: merge small units into larger, denser ones.
The job itself should be idempotent. If it crashes mid-aggregation and reruns, it should produce the same results. Aggregating already-aggregated data would corrupt your numbers.
At Oracle, we had exactly this problem with operational logs. The NSSF system logged every config operation with full request/response payloads. Useful for debugging, but the log tables grew by gigabytes per week. We set up tiered retention: full logs for 7 days, summary stats (operation type, duration, result) for 90 days, daily counts beyond that. The tricky part was keeping full detail for error cases longer than success cases. Failed operations stayed at full resolution for 30 days because that’s what you actually need when investigating incidents.
What I’m Learning#
Downsampling is a storage cost vs information fidelity trade-off. The mistake is thinking it’s just about averages. Keep percentiles, keep extremes, keep counts. And always keep error data at higher resolution than success data, because errors are what you investigate. The tiered approach connects to compaction in databases: both are about rewriting data into a more efficient form.
What retention policies do you use for your metrics, and have you ever regretted downsampling too aggressively?