Skip to content

Amey-Thakur/HADOOP

Repository files navigation

HADOOP | Matrix Multiplication

License: MIT Status Technology Developed by Amey Thakur

An implementation of Matrix Multiplication using the MapReduce paradigm in Python, providing a foundation for distributed data processing and large-scale computational arithmetic.

Source Code  ·  Technical Specification  ·  Google Colaboratory  ·  Live Demo


Authors  ·  Overview  ·  Features  ·  Structure  ·  Results  ·  Quick Start  ·  Usage Guidelines  ·  License  ·  About  ·  Acknowledgments


Authors

Terna Engineering College | Computer Engineering | Batch of 2022

Amey Thakur
Amey Thakur

ORCID

Overview

HADOOP | Matrix Multiplication represents a pivotal milestone in the study of the MapReduce Paradigm. Developed during the academic study of Big Data Analytics, this project focuses on the practical application of parallel processing to execute large-scale matrix multiplication across distributed environments.

The project utilizes Python 3 to simulate the partitioning and aggregation of data, leveraging a coordinate-based key-value system to ensure computational accuracy and efficiency in a distributed architecture.

Tip

Hadoop Streaming Integration

This implementation is designed for compatibility with Hadoop Streaming, utilizing standard UNIX pipes (stdin/stdout). The Mapper partitions Matrix A and B elements into intermediate cell-based keys, while the Reducer aggregates these batches to compute the final dot product for the target matrix.


Features

Feature Description
Parallel Computing Implements distributed matrix multiplication logic via MapReduce.
Coordinate Mapping Mapper dynamically expands elements based on target result cell coordinates.
Efficient Aggregation Reducer performs memory-efficient dot product calculations.
Local Simulation Supports local execution using standard UNIX pipe-based simulations.
Archival Quality Integrated scholarly citation metadata and technical specifications.

Tech Stack

  • Language: Python 3.x
  • Framework: MapReduce (Distributed Computing)
  • Tooling: Hadoop Streaming API
  • Batch Processing: Distributed Matrix Arithmetic

Project Structure

HADOOP/
│
├── docs/                            # Technical Documentation
│   └── SPECIFICATION.md             # Architecture & Flow Specification
│
├── Source Code/                     # Primary Application Layer
│   ├── HADOOP.ipynb                 # Interactive Experimental Environment
│   ├── mapper.py                    # Map Phase Logic (Partitioning)
│   ├── reducer.py                   # Reduce Phase Logic (Aggregation)
│   ├── input.txt                    # Sample Matrix Input Data
│   └── cache.txt                    # Dimension Configuration Cache
│
├── .gitattributes                   # Git configuration
├── .gitignore                       # Git exclusion manifest
├── CITATION.cff                     # Scholarly Citation Metadata
├── codemeta.json                    # Machine-Readable Project Metadata
├── LICENSE                          # MIT License Terms
├── README.md                        # Comprehensive Archival Entrance
└── SECURITY.md                      # Security Policy & Protocol

System Architecture & Process Flow

Architectural Logic (Map -> Shuffle -> Reduce)

graph TD
    Input[("Input Data (Matrix A & B)")] -->|Reads| Mapper["Mapper (mapper.py)"]
    Cache[("Cache (Matrix Dimensions)")] -->|Configures| Mapper
    
    subgraph Map_Phase ["Map Phase"]
        Mapper -->|Partitions| Intermediate["Intermediate Key-Value Pairs"]
        Intermediate -->|Emits| Shuffle["Shuffle & Sort"]
    end
    
    subgraph Reduce_Phase ["Reduce Phase"]
        Shuffle -->|Groups by Key| Reducer["Reducer (reducer.py)"]
        Reducer -->|Dot Product| Result["Result Aggregator"]
    end
    
    Result -->|Writes| Output[("Output (Matrix Product)")]
Loading

Quick Start

1. Prerequisites

  • Runtime: Python 3.x installed on your workstation.
  • Environment: Any modern standards-compliant Unix-like environment for Python execution.

Warning

Local Execution

While the project is designed for Hadoop, it can be simulated locally using standard UNIX pipes. Ensure that input.txt and cache.txt are correctly configured in the Source Code directory before execution.

2. Setup & Deployment

  1. Clone the Repository: Start by cloning the project to your local workstation:

    git clone https://github.com/Amey-Thakur/HADOOP.git
    cd HADOOP
  2. Initialize Environment: Navigate to the Source Code directory where the computational logic resides:

    cd "Source Code"
  3. Configure Matrix Dimensions: Verify that cache.txt is configured with the correct dimensions (e.g., 2,2 for a 2x2 matrix product):

    # Format: [A_Rows],[B_Cols]
    echo "2,2" > cache.txt
  4. Execute Distributed Logic: Simulate the MapReduce pipeline by piping the input through the Mapper and Reducer:

    cat input.txt | python mapper.py | sort -k1,1 | python reducer.py

Tip

Distributed Matrix Multiplication | MapReduce Simulation

Experience the high-fidelity computational simulation of this 7th Semester Big Data Analytics project, featuring a multi-stage MapReduce pipeline (Map -> Shuffle -> Reduce) designed for scalable distributed matrix arithmetic and parallelized data processing.

Launch Interactive Notebook


Usage Guidelines

This repository is openly shared to support learning and knowledge exchange across the academic community.

For Students
Use this project as reference material for understanding distributed data representations, intermediate key mapping, and the practical implementation of the MapReduce paradigm. The source code is available for study to facilitate self-paced learning and exploration of parallel computing patterns.

For Educators
This project may serve as a practical lab example or supplementary teaching resource for Big Data Analytics (CSDLO7032 / CSL704). Attribution is appreciated when utilizing content.

For Researchers
The documentation and design approach may provide insights into academic project structuring and the implementation of distributed arithmetic architectures.


License

This repository and all its creative and technical assets are made available under the MIT License. See the LICENSE file for complete terms.

Note

Summary: You are free to share and adapt this content for any purpose, even commercially, as long as you provide appropriate attribution to the original author.

Copyright © 2021 Amey Thakur


About This Repository

Created & Maintained by: Amey Thakur
Academic Journey: Bachelor of Engineering in Computer Engineering (2018-2022)
Institution: Terna Engineering College, Navi Mumbai
University: University of Mumbai

This project features the HADOOP Matrix Multiplication, an application developed during my early explorations in Big Data Analytics. It highlights the use of the MapReduce paradigm to build scalable, parallelized computational solutions.

Connect: GitHub  ·  LinkedIn  ·  ORCID

Grateful acknowledgment to the faculty members of the Department of Computer Engineering at Terna Engineering College for their guidance and instruction in Big Data Analytics. Their expertise in distributed systems and parallel processing helped me develop a strong understanding of professional computational development methodologies.

Special thanks to the mentors and peers whose encouragement, discussions, and support contributed meaningfully to this learning experience.


↑ Back to Top

Authors  ·  Overview  ·  Features  ·  Structure  ·  Results  ·  Quick Start  ·  Usage Guidelines  ·  License  ·  About  ·  Acknowledgments


🔬 Big Data Analytics Laboratory   ·   🐘 HADOOP


Computer Engineering (B.E.) - University of Mumbai

Semester-wise curriculum, laboratories, projects, and academic notes.