Content-Addressable Storage

Two users upload the same 50 MB file. Naive storage keeps two copies. Content-addressable storage keeps one.

What “Content-Addressable” Means#

Instead of locating data by where it lives (a path, a filename), you locate it by what it is. Hash the content, use the hash as the key. Same content, same hash, same storage location. SHA-256 a file and store the result as its address.

The practical consequence: deduplication becomes automatic. If block X already exists in your blob store, you just store a reference. No copy. No extra bytes.

Chunking First, Then Hashing#

The smarter move is to split files into fixed-size blocks (say, 4 MB) before hashing. Two files sharing a 4 MB chunk in the middle deduplicate that chunk even if the files are otherwise different. Upload a revised document with one section changed and only the changed blocks get transferred.

graph TD A[File Upload] --> B[Split into 4MB Chunks] B --> C{Hash Each Chunk} C --> D{Chunk Exists in Store?} D -->|Yes| E[Store Reference Only] D -->|No| F[Write Chunk to Blob Store] E --> G[Write Metadata Record] F --> G style A fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style B fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style C fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style D fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style E fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style F fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style G fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff

At Salesforce#

Attachment storage was path-based when I worked on it. Files lived at org_id/record_id/filename. Same presentation deck uploaded by 400 sales reps across the same org: 400 copies. After an internal audit found about 38% of stored bytes were exact duplicates, the team moved to content-addressed blob storage. Storage costs dropped noticeably, though I can’t share the exact figures.

The metadata layer still maps org + record + filename to a hash, so the user experience doesn’t change. The deduplication is invisible.

What I’m Learning#

The interesting edge case: garbage collection. If you delete a file, you can’t immediately delete the chunks because other files might reference them. You need reference counting or a periodic sweep. It’s more complex than path-based deletion.

Have you worked with content-addressed storage, or built something similar from scratch?