HADOOP | Matrix Multiplication

An implementation of Matrix Multiplication using the MapReduce paradigm in Python, providing a foundation for distributed data processing and large-scale computational arithmetic.

Source Code · Technical Specification · Google Colaboratory · Live Demo

Authors · Overview · Features · Structure · Results · Quick Start · Usage Guidelines · License · About · Acknowledgments

Authors

Terna Engineering College | Computer Engineering | Batch of 2022

Amey Thakur

Overview

HADOOP | Matrix Multiplication represents a pivotal milestone in the study of the MapReduce Paradigm. Developed during the academic study of Big Data Analytics, this project focuses on the practical application of parallel processing to execute large-scale matrix multiplication across distributed environments.

The project utilizes Python 3 to simulate the partitioning and aggregation of data, leveraging a coordinate-based key-value system to ensure computational accuracy and efficiency in a distributed architecture.

Tip

Hadoop Streaming Integration

This implementation is designed for compatibility with Hadoop Streaming, utilizing standard UNIX pipes (stdin/stdout). The Mapper partitions Matrix A and B elements into intermediate cell-based keys, while the Reducer aggregates these batches to compute the final dot product for the target matrix.

Features

Feature	Description
Parallel Computing	Implements distributed matrix multiplication logic via MapReduce.
Coordinate Mapping	Mapper dynamically expands elements based on target result cell coordinates.
Efficient Aggregation	Reducer performs memory-efficient dot product calculations.
Local Simulation	Supports local execution using standard UNIX pipe-based simulations.
Archival Quality	Integrated scholarly citation metadata and technical specifications.

Tech Stack

Language: Python 3.x
Framework: MapReduce (Distributed Computing)
Tooling: Hadoop Streaming API
Batch Processing: Distributed Matrix Arithmetic

Project Structure

HADOOP/
│
├── docs/                            # Technical Documentation
│   └── SPECIFICATION.md             # Architecture & Flow Specification
│
├── Source Code/                     # Primary Application Layer
│   ├── HADOOP.ipynb                 # Interactive Experimental Environment
│   ├── mapper.py                    # Map Phase Logic (Partitioning)
│   ├── reducer.py                   # Reduce Phase Logic (Aggregation)
│   ├── input.txt                    # Sample Matrix Input Data
│   └── cache.txt                    # Dimension Configuration Cache
│
├── .gitattributes                   # Git configuration
├── .gitignore                       # Git exclusion manifest
├── CITATION.cff                     # Scholarly Citation Metadata
├── codemeta.json                    # Machine-Readable Project Metadata
├── LICENSE                          # MIT License Terms
├── README.md                        # Comprehensive Archival Entrance
└── SECURITY.md                      # Security Policy & Protocol

System Architecture & Process Flow

Architectural Logic (Map -> Shuffle -> Reduce)

graph TD
    Input[("Input Data (Matrix A & B)")] -->|Reads| Mapper["Mapper (mapper.py)"]
    Cache[("Cache (Matrix Dimensions)")] -->|Configures| Mapper
    
    subgraph Map_Phase ["Map Phase"]
        Mapper -->|Partitions| Intermediate["Intermediate Key-Value Pairs"]
        Intermediate -->|Emits| Shuffle["Shuffle & Sort"]
    end
    
    subgraph Reduce_Phase ["Reduce Phase"]
        Shuffle -->|Groups by Key| Reducer["Reducer (reducer.py)"]
        Reducer -->|Dot Product| Result["Result Aggregator"]
    end
    
    Result -->|Writes| Output[("Output (Matrix Product)")]

Quick Start

1. Prerequisites

Runtime: Python 3.x installed on your workstation.
Environment: Any modern standards-compliant Unix-like environment for Python execution.

Warning

Local Execution

While the project is designed for Hadoop, it can be simulated locally using standard UNIX pipes. Ensure that input.txt and cache.txt are correctly configured in the Source Code directory before execution.

2. Setup & Deployment

Clone the Repository: Start by cloning the project to your local workstation:
```
git clone https://github.com/Amey-Thakur/HADOOP.git
cd HADOOP
```
Initialize Environment: Navigate to the Source Code directory where the computational logic resides:
```
cd "Source Code"
```
Configure Matrix Dimensions: Verify that cache.txt is configured with the correct dimensions (e.g., 2,2 for a 2x2 matrix product):
```
# Format: [A_Rows],[B_Cols]
echo "2,2" > cache.txt
```
Execute Distributed Logic: Simulate the MapReduce pipeline by piping the input through the Mapper and Reducer:
```
cat input.txt | python mapper.py | sort -k1,1 | python reducer.py
```

Tip

Distributed Matrix Multiplication | MapReduce Simulation

Experience the high-fidelity computational simulation of this 7th Semester Big Data Analytics project, featuring a multi-stage MapReduce pipeline (Map -> Shuffle -> Reduce) designed for scalable distributed matrix arithmetic and parallelized data processing.

Launch Interactive Notebook

Usage Guidelines

This repository is openly shared to support learning and knowledge exchange across the academic community.

For Students
Use this project as reference material for understanding distributed data representations, intermediate key mapping, and the practical implementation of the MapReduce paradigm. The source code is available for study to facilitate self-paced learning and exploration of parallel computing patterns.

For Educators
This project may serve as a practical lab example or supplementary teaching resource for Big Data Analytics (CSDLO7032 / CSL704). Attribution is appreciated when utilizing content.

For Researchers
The documentation and design approach may provide insights into academic project structuring and the implementation of distributed arithmetic architectures.

License

This repository and all its creative and technical assets are made available under the MIT License. See the LICENSE file for complete terms.

Note

Summary: You are free to share and adapt this content for any purpose, even commercially, as long as you provide appropriate attribution to the original author.

About This Repository

Created & Maintained by: Amey Thakur
Academic Journey: Bachelor of Engineering in Computer Engineering (2018-2022)
Institution: Terna Engineering College, Navi Mumbai
University: University of Mumbai

This project features the HADOOP Matrix Multiplication, an application developed during my early explorations in Big Data Analytics. It highlights the use of the MapReduce paradigm to build scalable, parallelized computational solutions.

Connect: GitHub · LinkedIn · ORCID

Grateful acknowledgment to the faculty members of the Department of Computer Engineering at Terna Engineering College for their guidance and instruction in Big Data Analytics. Their expertise in distributed systems and parallel processing helped me develop a strong understanding of professional computational development methodologies.

Special thanks to the mentors and peers whose encouragement, discussions, and support contributed meaningfully to this learning experience.

↑ Back to Top

Authors · Overview · Features · Structure · Results · Quick Start · Usage Guidelines · License · About · Acknowledgments

🔬 Big Data Analytics Laboratory · 🐘 HADOOP

🎓 Computer Engineering Repository

Computer Engineering (B.E.) - University of Mumbai

Semester-wise curriculum, laboratories, projects, and academic notes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HADOOP | Matrix Multiplication

Authors

Overview

Features

Tech Stack

Project Structure

System Architecture & Process Flow

Architectural Logic (Map -> Shuffle -> Reduce)

Quick Start

1. Prerequisites

2. Setup & Deployment

Usage Guidelines

License

About This Repository

🎓 Computer Engineering Repository

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
.github/workflows		.github/workflows
Source Code		Source Code
docs		docs
.gitattributes		.gitattributes
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
codemeta.json		codemeta.json
index.html		index.html

Folders and files

Latest commit

History

Repository files navigation

HADOOP | Matrix Multiplication

Authors

Overview

Features

Tech Stack

Project Structure

System Architecture & Process Flow

Architectural Logic (Map -> Shuffle -> Reduce)

Quick Start

1. Prerequisites

2. Setup & Deployment

Usage Guidelines

License

About This Repository

🎓 Computer Engineering Repository

About

Topics

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages