You save a 200 MB file. One word changed. Re-uploading 200 MB to sync that change is absurd. Delta sync is how you avoid it.

The Core Idea#

Split the file into blocks. On an update, compare the new version’s blocks against the stored version’s blocks. Transfer only the blocks that changed.

Rsync pioneered this. It computes a fast rolling checksum for each block on the remote side, sends those checksums to the client, the client finds which local blocks match and which don’t, and transmits only the mismatches.

The Block Boundary Problem#

Fixed-size blocks have a nasty failure mode. Insert text at the start of a document and every block shifts. The block boundaries misalign, nothing matches, you’re back to transferring the whole file.

Variable-length chunking with Rabin fingerprinting solves this. Instead of cutting every N bytes, you cut at content-defined boundaries based on a rolling hash. Insertions only affect the surrounding chunks. The rest of the file’s chunk boundaries stay stable.

graph TD A[File Modified] --> B[Compute Block Checksums] B --> C[Compare Against Stored Checksums] C --> D{Which Blocks Changed?} D --> E[Transfer Changed Blocks Only] D --> F[Skip Unchanged Blocks] E --> G[Reconstruct File on Server] F --> G style A fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style B fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style C fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style D fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style E fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style F fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style G fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff

At Oracle#

We had a scheduled job that synced large configuration exports to a downstream system nightly. Each export file was around 80 MB. For months it re-uploaded the whole file every run, even when only a handful of rows changed. I added checksum comparison at the block level, transferring only the changed segments. Network time dropped from about 4 minutes to under 15 seconds on a typical run. Not revolutionary, but the difference was obvious and required zero changes to the downstream system.

What I’m Learning#

Delta sync pairs well with content-addressable storage because changed blocks get new hashes and unchanged blocks reuse existing ones. They’re complementary at different granularities.

Have you hit the block boundary problem in practice, or used variable chunking?