Real Estate Price Predictor 🏡

A comprehensive machine learning project to predict median home values using exploratory data analysis, advanced feature engineering, and a production-ready API with interactive UI.

Project Overview

This project provides an end-to-end solution for real estate price prediction:

Data Analysis: Comprehensive EDA with distribution analysis and correlation studies
ML Pipeline: Trained model with feature engineering and data preprocessing
Production API: FastAPI-based REST API for price predictions
Interactive UI: Streamlit web application for user-friendly predictions
Containerization: Docker support for easy deployment

Dataset

Source: Boston Housing Dataset

Features (14 input features):

CRIM - Per capita crime rate by town
ZN - Proportion of residential land zoned for lots over 25,000 sq.ft.
INDUS - Proportion of non-retail business acres per town
CHAS - Charles River dummy variable (1 if bounds river, 0 otherwise)
NOX - Nitric oxides concentration (parts per 10 million)
RM - Average number of rooms per dwelling
AGE - Proportion of owner-occupied units built prior to 1940
DIS - Weighted distances to five Boston employment centers
RAD - Index of accessibility to radial highways
TAX - Full-value property-tax rate per $10,000
PTRATIO - Pupil-teacher ratio by town
B - 1000(Bk - 0.63)² where Bk is proportion of blacks by town
LSTAT - % lower status of the population

Target:

MEDV - Median value of owner-occupied homes in $1000's

Project Structure

real-estate-predictor/
├── artifacts/
│   └── house_price_pipeline.pkl    # Trained model pipeline (scikit-learn)
├── data/
│   ├── data.csv                    # Raw dataset
│   ├── train.csv                   # Training set
│   ├── test.csv                    # Test set
│   └── processed/
│       ├── train_processed.csv     # Processed training data
│       └── test_processed.csv      # Processed test data
├── src/
│   ├── data_ingestion.py           # Data loading utilities
│   ├── preprocessing.py            # Data cleaning and transformation
│   ├── feature_engineering.py      # Feature creation and engineering
│   ├── eda.py                      # EDA analysis functions
│   ├── models.py                   # Model definitions
│   └── model_trainer.py            # Model training and evaluation
├── notebooks/
│   └── eda_comprehensive.ipynb     # Interactive EDA notebook
├── tests/                          # Unit tests
├── main.py                         # FastAPI application
├── streamlit_app.py                # Streamlit web interface
├── Dockerfile                      # Docker containerization
├── docker-compose.yml              # Docker Compose configuration
├── requirements.txt                # Python dependencies
└── README.md                       # This file

Key Findings from EDA

Data Characteristics

Missing Values: 5 missing values in RM column
Skewness Issues:
- CRIM (Crime rate): Right-skewed (5.25) - extreme outliers
- B: Right-skewed (3.43) - extreme outliers
- ZN: Right-skewed
- LSTAT, PTRATIO: Left-skewed

Correlations with Price (MEDV)

RM (Rooms): +0.67 correlation - Strongest positive relationship
- More rooms = Higher price
LSTAT (Lower status %): -0.74 correlation - Strong negative relationship

Data Issues Identified

Right-skewed distributions in CRIM and B require log transformation
Outliers in crime rate and racial demographics
Non-normal target variable distribution requiring transformation

Methodology

1. Exploratory Data Analysis (EDA)

Distribution analysis using histograms
Outlier detection with boxplots
Correlation matrix analysis
Skewness assessment

2. Data Preprocessing

Handle missing values
Apply log transformation to skewed features (CRIM, B, ZN)
Normalize/standardize features for model compatibility

3. Feature Engineering

Identify primary predictive features (RM, LSTAT)
Create derived features if needed
Handle categorical variables

4. Model Training

Linear Regression baseline
Evaluate with appropriate metrics
Prevent bias from outliers and skewed distributions

Quick Start

Prerequisites

pip install pandas numpy scikit-learn matplotlib seaborn

Run Analysis

Exploratory Analysis:

jupyter notebook src/eda_comprehensive.ipynb

Full Pipeline:
```
python main.py
```

Usage Examples

Loading Data (in `src/eda_comprehensive.ipynb`)

import pandas as pd
df = pd.read_csv("../data/data.csv")

Streamlit app

streamlit run streamlit_app.py

and you must run backend fastapi before making prediction on streamlit app.

Docker files

In streamlit_app.py, change your request URL from http://localhost:8000/predict to http://backend:8000/predict to allow the containers to talk to each other.

Visualizing Room vs Price Relationship

import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(10, 6))
sns.regplot(x='RM', y='MEDV', data=df, scatter_kws={'alpha': 0.6}, line_kws={'color': 'red'})
plt.xlabel('Average Number of Rooms per Dwelling (RM)')
plt.ylabel('Median Home Value in $1000s (MEDV)')
plt.title('Room Count vs House Price with Regression Line')
plt.grid(True, alpha=0.3)
plt.show()

Model Performance

Target variable transformation recommended due to non-normal distribution
Log transformation applied to highly skewed features
Expected improved model performance after addressing skewness

Files Description

File	Purpose
`main.py`	Entry point for the full pipeline
`src/eda.py`	Automated EDA functions
`src/eda_comprehensive.ipynb`	Interactive exploration and visualization
`src/data_ingestion.py`	Data loading utilities
`src/preprocessing.py`	Data cleaning and transformation
`src/feature_engineering.py`	Feature creation and selection
`src/model_trainer.py`	Model training and evaluation

Next Steps

Apply transformations to normalize skewed distributions
Train regression models with processed features
Cross-validate model performance
Evaluate using metrics (R², RMSE, MAE)
Hyperparameter tuning for optimal performance

Author

Saqib Iqbal

License

This project uses the Boston Housing Dataset for educational purposes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Real Estate Price Predictor 🏡

Project Overview

Dataset

Project Structure

Key Findings from EDA

Data Characteristics

Correlations with Price (MEDV)

Data Issues Identified

Methodology

1. Exploratory Data Analysis (EDA)

2. Data Preprocessing

3. Feature Engineering

4. Model Training

Quick Start

Prerequisites

Run Analysis

Usage Examples

Loading Data (in `src/eda_comprehensive.ipynb`)

Streamlit app

Docker files

Visualizing Room vs Price Relationship

Model Performance

Files Description

Next Steps

Author

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.vscode		.vscode
artifacts		artifacts
data		data
notebooks		notebooks
src		src
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
main.py		main.py
requirements.txt		requirements.txt
streamlit_app.py		streamlit_app.py

Folders and files

Latest commit

History

Repository files navigation

Real Estate Price Predictor 🏡

Project Overview

Dataset

Project Structure

Key Findings from EDA

Data Characteristics

Correlations with Price (MEDV)

Data Issues Identified

Methodology

1. Exploratory Data Analysis (EDA)

2. Data Preprocessing

3. Feature Engineering

4. Model Training

Quick Start

Prerequisites

Run Analysis

Usage Examples

Loading Data (in src/eda_comprehensive.ipynb)

Streamlit app

Docker files

Visualizing Room vs Price Relationship

Model Performance

Files Description

Next Steps

Author

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Loading Data (in `src/eda_comprehensive.ipynb`)

Packages