TL;DR Processing large-scale data sequentially is slow. MapReduce is a framework for parallel batch processing: you write map and reduce, and the system handles splitting the work, grouping intermediate results, retries, and execution across machines.
The Problem
Suppose we have a very large set of web server logs and want to count how many times each URL was accessed. On one machine, the logic is simple:
- Read every log line.
- Extract the URL.
- Increment a counter for that URL.
This breaks down when the input is huge. We can process chunks in parallel, but then the same URL may appear in many chunks across many workers. The hard part is making sure all partial counts for the same URL meet in one place and get aggregated correctly.
The Model
MapReduce uses key-value pairs as the interface between steps. map receives input records and emits intermediate pairs:
map(k1, v1) -> list(k2, v2)
reduce receives one intermediate key and all values emitted for that key:
reduce(k2, list(v2)) -> list(k3, v3)
The intermediate key is the contract between map and reduce: all values with this key should be aggregated together. The shuffle phase moves map outputs around the cluster so all values for the same key reach the same reducer partition.
Example: URL Access Frequency
Input:
GET /home
GET /pricing
GET /home
GET /docs
GET /home
GET /pricing
Map output:
(/home, 1)
(/pricing, 1)
(/home, 1)
(/docs, 1)
(/home, 1)
(/pricing, 1)
Shuffle/group:
/home -> [1, 1, 1]
/pricing -> [1, 1]
/docs -> [1]
Reduce output:
/home -> 3
/pricing -> 2
/docs -> 1
Many workers can process different log chunks in parallel. Each worker only emits (URL, 1) pairs. The framework handles routing and grouping so reducers can produce the final counts.
What the Framework Handles
The value of MapReduce is that it hides distributed-systems work behind a simple interface:
- splitting input into chunks and scheduling map tasks near the data when possible
- partitioning intermediate keys across reducers, often with
hash(key) mod R - shuffling, sorting, and grouping values so each reducer receives
key -> list(values) - retrying failed tasks and sometimes duplicating slow straggler tasks
The important caveat: reducer partitioning does not make data magically local. Shuffle can move a lot of data over the network, so it is often one of the expensive parts of the job.
When It Does Not Fit
MapReduce fits large offline batch jobs with independent records and aggregation by key. It is a poor fit for:
- low-latency serving: the job startup and shuffle overhead are too high.
- interactive queries: users expect fast responses, not batch-job execution.
- small jobs: framework overhead can dominate the actual work.
- iterative algorithms: rereading and reshuffling the same data is inefficient.
- complex multi-stage workflows: chaining many MapReduce jobs becomes awkward.
Takeaway
Map does local work. Shuffle makes the same keys meet. Reduce aggregates.
The abstraction is powerful because it constrains the program enough for the framework to safely handle parallel execution, failures, and data movement.