Content Fingerprinting: Detecting Near-Duplicates at Scale
Two documents differ by one paragraph. They’re not identical, so their SHA-256 hashes are completely different. But they’re 95% the same content. How do you detect that without comparing every pair?
At scale, pairwise comparison is impossible. A million documents means 500 billion pairs. You need a shortcut.
SimHash#
The trick: build a hash where similar inputs produce similar outputs. Regular hashes do the opposite (tiny change, completely different hash). SimHash preserves similarity.
The algorithm: break content into features (words, n-grams). Hash each feature. For each bit position, if the feature’s hash bit is 1, add the feature’s weight. If 0, subtract it. The sign of each position becomes the final fingerprint.
public long simhash(String content) {
int[] bitCounts = new int[64];
for (String token : tokenize(content)) {
long hash = murmurHash(token);
for (int i = 0; i < 64; i++) {
bitCounts[i] += ((hash >> i) & 1) == 1 ? 1 : -1;
}
}
long fingerprint = 0;
for (int i = 0; i < 64; i++) {
if (bitCounts[i] > 0) fingerprint |= (1L << i);
}
return fingerprint;
}
Two similar documents produce fingerprints that differ in only a few bits. Measure similarity with Hamming distance (count differing bits). Hamming distance of 3 out of 64 bits? Very similar. Distance of 30? Completely different.
The Probabilistic Family#
SimHash sits alongside bloom filters and HyperLogLog. All three trade exact answers for massive space savings. Bloom filters answer “have I seen this before?” HyperLogLog answers “how many unique items?” SimHash answers “how similar are these two items?”
A 64-bit SimHash fingerprint replaces comparing entire documents. Store the fingerprint, not the content. Compare fingerprints, not documents.
At Oracle, we had a problem with near-identical NSSF configuration payloads from different network functions. Slightly different metadata, same actual config. These duplicates were clogging our processing pipeline. We added SimHash fingerprinting at ingestion: hash the config payload, compare against recent fingerprints. Hamming distance under 5? Flag as likely duplicate and skip redundant processing. Cut processing volume by roughly 30% during bulk registration events.
What I’m Learning#
Content fingerprinting changes the question from “are these identical?” to “are these similar enough?” That shift matters at scale. You can’t afford to compare everything against everything. But a 64-bit fingerprint per document lets you find near-duplicates in constant time per comparison.
The threshold (how many bits of difference is “similar enough?”) is the design decision. Too strict and you miss duplicates. Too loose and you get false matches.
What near-duplicate problems have you encountered in your systems?