This project is a functional implementation of the MapReduce programming model, written entirely in Go. It provides a core library for distributed data processing and includes a classic word count example to demonstrate its capabilities. The system processes large datasets in parallel across multiple worker nodes, coordinated by a central master.
MapReduce is a programming model for processing and generating large data sets with a parallel, distributed algorithm on a cluster. A MapReduce job is composed of two main phases:
- Map Phase: The master node takes the input, partitions it into smaller sub-problems, and distributes them to worker nodes. Each worker applies a
mapfunction to its sub-problem and writes the intermediate results. - Reduce Phase: The master node signals the workers to start the reduce phase. Each worker collects the intermediate results relevant to its partition, applies a
reducefunction to aggregate the data, and produces the final output.
The framework handles the complexities of parallelization, fault tolerance, data distribution, and load balancing.
This implementation is structured as a library (pkg/) and a standalone example (examples/wordcount):
-
Master (
pkg/master): The core logic for the central coordinator. It is responsible for:- Accepting job submissions via RPC.
- Scheduling Map and Reduce tasks to available Worker nodes.
- Tracking the state of each task (Idle, InProgress, Completed).
- Transitioning from the Map phase to the Reduce phase.
- Signaling when the entire job is finished.
-
Worker (
pkg/worker): The core logic for a worker process. A single worker can perform both Map and Reduce tasks as assigned by the Master. The worker is generic and receives themapFandreduceFfunctions from the specific application (e.g., word count) that uses it. -
Word Count Example (
examples/wordcount): A self-contained application that demonstrates the library's usage. It can be run in one of three roles:- Master: Starts the master server.
- Worker: Starts a worker process, providing it with the word count
mapandreducefunctions. - Submit: Connects to the master and submits a job with a directory of input files.
- Library-Based Design: The core MapReduce logic is decoupled from the specific application, making the framework reusable for other problems.
- RPC-Based Communication: Robust and type-safe communication between the master and workers using Go's built-in
net/rpcpackage. - Centralized Task Scheduling: A thread-safe
TaskQueueimplemented with Go channels ensures that workers receive tasks efficiently and concurrently. - Automated Testing: The project includes unit tests for core library logic (
make test) and a full end-to-end execution script (make script).
/
├── examples/
│ └── wordcount/ # Self-contained word count application
├── inputs/ # Directory for input text files
├── outputs/ # Directory for final reduce output (gitignored)
├── tmp/ # Directory for intermediate files (gitignored)
├── pkg/
│ ├── master/ # Core logic for the master
│ ├── scheduler/ # Task queue implementation
│ ├── types/ # Shared types for tasks and RPC
│ └── worker/ # Core logic for the worker
├── .gitignore
├── go.mod
├── Makefile # For easy execution of common tasks
└── run-wordcount.sh # End-to-end execution script
- Go (version 1.21 or later)
make
This is the easiest way to see the entire system in action. The script automates starting the master and workers, submitting the job, and waiting for completion.
make scriptThis command will:
- Clean up artifacts from any previous runs.
- Start one master process in the background (with logs visible in your terminal).
- Start three worker processes in the background (with logs discarded).
- Submit a job using the files in the
inputs/directory. - Wait for the master to signal that the job is complete.
After execution, you can inspect the final results in the outputs/ directory.
To run the unit tests for the core library packages:
make testFor debugging or a more hands-on demonstration, you can run the components in separate terminals:
-
Terminal 1: Start the Master
make master
-
Terminal 2 (and 3, 4...): Start one or more Workers
make wordcount-worker
-
Terminal 5: Submit the Job
make wordcount-submit