A simplified implementation of Google's MapReduce framework optimized for deployment on a single VM with Docker containers.
- Coordinator: Manages job submission and task distribution
- Workers: Execute map and reduce tasks (4 workers, each limited to 1 CPU)
- Shared Storage: Common volume mounted to all containers
- Client: Command-line tool for job submission and monitoring
- Docker and Docker Compose
- Python 3.9+
docker-compose up --buildThis will start:
- 1 Coordinator container (port 50051)
- 4 Worker containers
From the host machine:
python src/client/client.py submit \
--input /shared/input/wordcount_input.txt \
--output /shared/output/wordcount_result \
--job-file /shared/jobs/wordcount.py \
--num-map 8 \
--num-reduce 4python src/client/client.py status --job-id <job_id>python src/client/client.py listpython src/client/client.py results --job-id <job_id>pip install -r requirements.txtpython -m grpc_tools.protoc -I./proto --python_out=./src --grpc_python_out=./src proto/coordinator.proto proto/worker.protopython src/coordinator/server.pypython src/worker/server.py --worker-id worker-1 --coordinator localhost:50051 --port 50052python src/client/client.py --coordinator-host localhost:50051 <command>.
├── docker-compose.yml # Docker Compose configuration
├── Dockerfile.coordinator # Coordinator container image
├── Dockerfile.worker # Worker container image
├── requirements.txt # Python dependencies
├── proto/ # gRPC protocol definitions
│ ├── coordinator.proto
│ └── worker.proto
├── src/
│ ├── coordinator/ # Coordinator implementation
│ │ └── server.py
│ ├── worker/ # Worker implementation
│ │ └── server.py
│ └── client/ # Client CLI
│ └── client.py
├── shared/ # Shared storage volume
│ ├── input/ # Input data files
│ ├── output/ # Final output files
│ ├── intermediate/ # Intermediate map outputs
│ └── jobs/ # User job files
├── tests/ # Test suite
└── examples/ # Example applications
- Docker Compose setup
- gRPC service definitions
- Basic coordinator scaffolding
- Basic worker scaffolding
- Shared storage setup
- Task scheduling and distribution
- Map task execution
- Reduce task execution
- Data partitioning and shuffling
- Worker heartbeat and monitoring
- Task retry and failure handling
- Job state management and phase transitions
- Progress tracking and reporting
- Combiner integration in map tasks
- Example applications
- Performance benchmarking
- Documentation and report
This is an educational project for learning MapReduce concepts.