An implementation of Matrix Multiplication using the MapReduce paradigm in Python, providing a foundation for distributed data processing and large-scale computational arithmetic.
Source Code · Technical Specification · Google Colaboratory · Live Demo
Authors · Overview · Features · Structure · Results · Quick Start · Usage Guidelines · License · About · Acknowledgments
HADOOP | Matrix Multiplication represents a pivotal milestone in the study of the MapReduce Paradigm. Developed during the academic study of Big Data Analytics, this project focuses on the practical application of parallel processing to execute large-scale matrix multiplication across distributed environments.
The project utilizes Python 3 to simulate the partitioning and aggregation of data, leveraging a coordinate-based key-value system to ensure computational accuracy and efficiency in a distributed architecture.
Tip
Hadoop Streaming Integration
This implementation is designed for compatibility with Hadoop Streaming, utilizing standard UNIX pipes (stdin/stdout). The Mapper partitions Matrix A and B elements into intermediate cell-based keys, while the Reducer aggregates these batches to compute the final dot product for the target matrix.
| Feature | Description |
|---|---|
| Parallel Computing | Implements distributed matrix multiplication logic via MapReduce. |
| Coordinate Mapping | Mapper dynamically expands elements based on target result cell coordinates. |
| Efficient Aggregation | Reducer performs memory-efficient dot product calculations. |
| Local Simulation | Supports local execution using standard UNIX pipe-based simulations. |
| Archival Quality | Integrated scholarly citation metadata and technical specifications. |
- Language: Python 3.x
- Framework: MapReduce (Distributed Computing)
- Tooling: Hadoop Streaming API
- Batch Processing: Distributed Matrix Arithmetic
HADOOP/
│
├── docs/ # Technical Documentation
│ └── SPECIFICATION.md # Architecture & Flow Specification
│
├── Source Code/ # Primary Application Layer
│ ├── HADOOP.ipynb # Interactive Experimental Environment
│ ├── mapper.py # Map Phase Logic (Partitioning)
│ ├── reducer.py # Reduce Phase Logic (Aggregation)
│ ├── input.txt # Sample Matrix Input Data
│ └── cache.txt # Dimension Configuration Cache
│
├── .gitattributes # Git configuration
├── .gitignore # Git exclusion manifest
├── CITATION.cff # Scholarly Citation Metadata
├── codemeta.json # Machine-Readable Project Metadata
├── LICENSE # MIT License Terms
├── README.md # Comprehensive Archival Entrance
└── SECURITY.md # Security Policy & Protocolgraph TD
Input[("Input Data (Matrix A & B)")] -->|Reads| Mapper["Mapper (mapper.py)"]
Cache[("Cache (Matrix Dimensions)")] -->|Configures| Mapper
subgraph Map_Phase ["Map Phase"]
Mapper -->|Partitions| Intermediate["Intermediate Key-Value Pairs"]
Intermediate -->|Emits| Shuffle["Shuffle & Sort"]
end
subgraph Reduce_Phase ["Reduce Phase"]
Shuffle -->|Groups by Key| Reducer["Reducer (reducer.py)"]
Reducer -->|Dot Product| Result["Result Aggregator"]
end
Result -->|Writes| Output[("Output (Matrix Product)")]
- Runtime: Python 3.x installed on your workstation.
- Environment: Any modern standards-compliant Unix-like environment for Python execution.
Warning
Local Execution
While the project is designed for Hadoop, it can be simulated locally using standard UNIX pipes. Ensure that input.txt and cache.txt are correctly configured in the Source Code directory before execution.
-
Clone the Repository: Start by cloning the project to your local workstation:
git clone https://github.com/Amey-Thakur/HADOOP.git cd HADOOP -
Initialize Environment: Navigate to the
Source Codedirectory where the computational logic resides:cd "Source Code"
-
Configure Matrix Dimensions: Verify that
cache.txtis configured with the correct dimensions (e.g.,2,2for a 2x2 matrix product):# Format: [A_Rows],[B_Cols] echo "2,2" > cache.txt
-
Execute Distributed Logic: Simulate the MapReduce pipeline by piping the input through the Mapper and Reducer:
cat input.txt | python mapper.py | sort -k1,1 | python reducer.py
Tip
Distributed Matrix Multiplication | MapReduce Simulation
Experience the high-fidelity computational simulation of this 7th Semester Big Data Analytics project, featuring a multi-stage MapReduce pipeline (Map -> Shuffle -> Reduce) designed for scalable distributed matrix arithmetic and parallelized data processing.
This repository is openly shared to support learning and knowledge exchange across the academic community.
For Students
Use this project as reference material for understanding distributed data representations, intermediate key mapping, and the practical implementation of the MapReduce paradigm. The source code is available for study to facilitate self-paced learning and exploration of parallel computing patterns.
For Educators
This project may serve as a practical lab example or supplementary teaching resource for Big Data Analytics (CSDLO7032 / CSL704). Attribution is appreciated when utilizing content.
For Researchers
The documentation and design approach may provide insights into academic project structuring and the implementation of distributed arithmetic architectures.
This repository and all its creative and technical assets are made available under the MIT License. See the LICENSE file for complete terms.
Note
Summary: You are free to share and adapt this content for any purpose, even commercially, as long as you provide appropriate attribution to the original author.
Copyright © 2021 Amey Thakur
Created & Maintained by: Amey Thakur
Academic Journey: Bachelor of Engineering in Computer Engineering (2018-2022)
Institution: Terna Engineering College, Navi Mumbai
University: University of Mumbai
This project features the HADOOP Matrix Multiplication, an application developed during my early explorations in Big Data Analytics. It highlights the use of the MapReduce paradigm to build scalable, parallelized computational solutions.
Connect: GitHub · LinkedIn · ORCID
Grateful acknowledgment to the faculty members of the Department of Computer Engineering at Terna Engineering College for their guidance and instruction in Big Data Analytics. Their expertise in distributed systems and parallel processing helped me develop a strong understanding of professional computational development methodologies.
Special thanks to the mentors and peers whose encouragement, discussions, and support contributed meaningfully to this learning experience.
Authors · Overview · Features · Structure · Results · Quick Start · Usage Guidelines · License · About · Acknowledgments
🔬 Big Data Analytics Laboratory · 🐘 HADOOP
Computer Engineering (B.E.) - University of Mumbai
Semester-wise curriculum, laboratories, projects, and academic notes.
