MapReduce: Processing Data That Won't Fit on One Machine

You need to count word frequencies across 10TB of text. One machine with 16GB RAM can’t even load the data. But 100 machines can each handle 100GB. The problem isn’t the computation. It’s the coordination.

MapReduce gives you a framework: you write two functions, the framework handles the rest.

The Three Phases#

Map: Each worker processes its chunk independently. Input: a slice of data. Output: key-value pairs.

Shuffle: The framework groups all values by key and routes them to the right reducer. All pairs with key “distributed” go to the same reducer, regardless of which mapper emitted them.

Reduce: Each reducer gets one key and all its values. Aggregate them into a final result.

// Map: emit (word, 1) for each word
public List<Pair<String, Integer>> map(String document) {
    List<Pair<String, Integer>> output = new ArrayList<>();
    for (String word : document.split("\\s+")) {
        output.add(Pair.of(word.toLowerCase(), 1));
    }
    return output;
}

// Reduce: sum all counts for each word
public Pair<String, Integer> reduce(String key, List<Integer> values) {
    return Pair.of(key, values.stream().mapToInt(i -> i).sum());
}

graph TD I1[Input Split 1] --> M1[Map Worker 1] I2[Input Split 2] --> M2[Map Worker 2] I3[Input Split 3] --> M3[Map Worker 3] M1 --> S[Shuffle + Sort by Key] M2 --> S M3 --> S S --> R1["Reduce: 'cache' → 847"] S --> R2["Reduce: 'index' → 1203"] S --> R3["Reduce: 'shard' → 429"] style I1 fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style I2 fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style I3 fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style M1 fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style M2 fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style M3 fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style S fill:#000000,stroke:#ff0000,stroke-width:2px,color:#fff style R1 fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style R2 fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff style R3 fill:#000000,stroke:#00ff00,stroke-width:2px,color:#fff

The Shuffle Is the Bottleneck#

Map is embarrassingly parallel: each chunk is independent. Reduce is embarrassingly parallel: each key is independent. But shuffle moves data between machines. If your mappers produce a billion key-value pairs, all of those need to be sorted, grouped, and sent to the right reducer over the network.

This is why the sharding strategy for intermediate keys matters. Consistent hashing can assign keys to reducers. Hot keys (very common words) create skewed load on a single reducer. Same problem as hot partitions in databases.

Beyond Word Count#

Building an inverted index is the canonical MapReduce application. Map emits (term, docId) for each word in each document. Reduce collects all docIds for each term into a posting list. The output is a searchable index built from data distributed across hundreds of machines.

Other uses: log analysis (map: extract error types, reduce: count per type), ETL pipelines, generating aggregate reports, training ML models on distributed data.

At Oracle, generating compliance reports across all NSSF configurations was a single-threaded Java job. It loaded every config, extracted metrics, and aggregated by region. Took over two hours as the config count grew. We restructured it as a map step (extract metrics per config, parallelized across a thread pool) and a reduce step (aggregate by region using ConcurrentHashMap merge). Same logic, same output, but the map phase ran across 8 threads. Report generation dropped from 2+ hours to about 20 minutes. Not Hadoop, just the pattern applied locally.

What I’m Learning#

MapReduce as a specific framework (Hadoop) is somewhat dated. But the pattern (split work, process in parallel, combine results) is everywhere. Every time you partition data across workers and aggregate results, you’re doing MapReduce. Understanding the pattern helps you spot parallelizable problems even when you’re not using a big data framework.

The insight that clicked for me: if your map function doesn’t need to see other records, the problem is parallelizable. If it does, you need a different approach.

Have you applied MapReduce-style patterns outside of traditional big data tools?

MapReduce: Processing Data That Won’t Fit on One Machine

The Three Phases#

The Shuffle Is the Bottleneck#

Beyond Word Count#

What I’m Learning#