Content-Addressable Storage
Two users upload the same 50 MB file. Naive storage keeps two copies. Content-addressable storage keeps one.
What “Content-Addressable” Means#
Instead of locating data by where it lives (a path, a filename), you locate it by what it is. Hash the content, use the hash as the key. Same content, same hash, same storage location. SHA-256 a file and store the result as its address.
The practical consequence: deduplication becomes automatic. If block X already exists in your blob store, you just store a reference. No copy. No extra bytes.
Chunking First, Then Hashing#
The smarter move is to split files into fixed-size blocks (say, 4 MB) before hashing. Two files sharing a 4 MB chunk in the middle deduplicate that chunk even if the files are otherwise different. Upload a revised document with one section changed and only the changed blocks get transferred.
At Salesforce#
Attachment storage was path-based when I worked on it. Files lived at org_id/record_id/filename. Same presentation deck uploaded by 400 sales reps across the same org: 400 copies. After an internal audit found about 38% of stored bytes were exact duplicates, the team moved to content-addressed blob storage. Storage costs dropped noticeably, though I can’t share the exact figures.
The metadata layer still maps org + record + filename to a hash, so the user experience doesn’t change. The deduplication is invisible.
What I’m Learning#
The interesting edge case: garbage collection. If you delete a file, you can’t immediately delete the chunks because other files might reference them. You need reference counting or a periodic sweep. It’s more complex than path-based deletion.
Have you worked with content-addressed storage, or built something similar from scratch?