This project is a simplified MapReduce framework implemented in Go, inspired by the classic MapReduce: Simplified Data Processing on Large Clusters (Google) paper. It showcases how large-scale distributed computation can be achieved by dividing tasks into Map and Reduce phases, coordinated by a central Coordinator and executed by multiple parallel Worker processes.
- ✅ Coordinator & Worker architecture with RPC communication
- ✅ Concurrent job scheduling and monitoring
- ✅ Fault-tolerant re-assignment of tasks (if worker fails or times out)
- ✅ Word count example using custom
mapFandreduceFfunctions - ✅ Threaded design: Coordinator and Worker run concurrently
- ✅ Modular code — easily plug in new Map/Reduce logic
mapreduce/
│
├── main.go # Entry point for Coordinator or Worker
│
├── mr/
│ ├── coordinator.go # Coordinator: assigns jobs & tracks progress
│ ├── worker.go # Worker: executes map/reduce tasks
│ ├── rpc.go # Common RPC definitions & utilities
│
├── input/ # Input files for word count
└── output/ # Output files generated by reduce tasks
- Takes input files and splits them into Map tasks.
- Assigns tasks to available Workers through RPC calls.
- Tracks completion and transitions to Reduce phase.
- Calls
Done()periodically to check if all jobs are finished.
- Connects to the coordinator via RPC.
- Requests work (
RequestTask), executes map or reduce using user-defined functions. - Writes intermediate files (
mr-X-Y) for map outputs and merged files for reduce results.
Defined in main.go:
func mapF(filename string, contents string) []mr.KeyValue {
// Splits text by whitespace and emits (word, "1") pairs
}
func reduceF(key string, values []string) string {
// Returns the count of occurrences for each key
}go build -o mapreduce .Run the coordinator with input files:
go run main.go coordinator -nreduce=3 input/pg-being_ernest.txt input/pg-grimm.txt input/pg-huckleberry.txtThis starts the Coordinator in a goroutine and monitors job completion via Done().
In separate terminals (or as background processes):
go run main.go workerEach worker connects to the coordinator and starts processing map/reduce tasks.
Once all tasks are finished:
Coordinator: all tasks finished
Coordinator exited cleanly
Output files will appear in your working directory (e.g., mr-out-0, mr-out-1, ...).
cat mr-out-0
the 120
and 95
to 80
...- Coordinator runs in the main process with a background goroutine checking
Done(). - Worker runs in its own goroutine to handle tasks concurrently.
- Communication happens via Go’s RPC system.
- The main thread sleeps periodically and exits only when all jobs are marked done.
| Flag | Description | Default |
|---|---|---|
-nreduce |
Number of reduce tasks | 3 |
| Input Files | List of files for map phase | Required |
You can implement your own computation by modifying mapF and reduceF.
For example:
- Word count → counting unique words
- Inverted index → mapping words to document lists
- Log analysis → grouping by IP or time range
Each new computation only requires changing the Map and Reduce logic — the coordinator and worker logic remains untouched.
- MapReduce: Simplified Data Processing on Large Clusters (Google)
- MIT 6.824: Distributed Systems Labs
- Go Official Documentation: https://golang.org/doc/
Uttam Raj B.Tech in Computational Mathematics @ NIT Agartala Passionate about systems programming, distributed systems, and Go.