A comprehensive machine learning project to predict median home values using exploratory data analysis, advanced feature engineering, and a production-ready API with interactive UI.
This project provides an end-to-end solution for real estate price prediction:
- Data Analysis: Comprehensive EDA with distribution analysis and correlation studies
- ML Pipeline: Trained model with feature engineering and data preprocessing
- Production API: FastAPI-based REST API for price predictions
- Interactive UI: Streamlit web application for user-friendly predictions
- Containerization: Docker support for easy deployment
Source: Boston Housing Dataset
Features (14 input features):
- CRIM - Per capita crime rate by town
- ZN - Proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS - Proportion of non-retail business acres per town
- CHAS - Charles River dummy variable (1 if bounds river, 0 otherwise)
- NOX - Nitric oxides concentration (parts per 10 million)
- RM - Average number of rooms per dwelling
- AGE - Proportion of owner-occupied units built prior to 1940
- DIS - Weighted distances to five Boston employment centers
- RAD - Index of accessibility to radial highways
- TAX - Full-value property-tax rate per $10,000
- PTRATIO - Pupil-teacher ratio by town
- B - 1000(Bk - 0.63)Β² where Bk is proportion of blacks by town
- LSTAT - % lower status of the population
Target:
- MEDV - Median value of owner-occupied homes in $1000's
real-estate-predictor/
βββ artifacts/
β βββ house_price_pipeline.pkl # Trained model pipeline (scikit-learn)
βββ data/
β βββ data.csv # Raw dataset
β βββ train.csv # Training set
β βββ test.csv # Test set
β βββ processed/
β βββ train_processed.csv # Processed training data
β βββ test_processed.csv # Processed test data
βββ src/
β βββ data_ingestion.py # Data loading utilities
β βββ preprocessing.py # Data cleaning and transformation
β βββ feature_engineering.py # Feature creation and engineering
β βββ eda.py # EDA analysis functions
β βββ models.py # Model definitions
β βββ model_trainer.py # Model training and evaluation
βββ notebooks/
β βββ eda_comprehensive.ipynb # Interactive EDA notebook
βββ tests/ # Unit tests
βββ main.py # FastAPI application
βββ streamlit_app.py # Streamlit web interface
βββ Dockerfile # Docker containerization
βββ docker-compose.yml # Docker Compose configuration
βββ requirements.txt # Python dependencies
βββ README.md # This file
- Missing Values: 5 missing values in RM column
- Skewness Issues:
- CRIM (Crime rate): Right-skewed (5.25) - extreme outliers
- B: Right-skewed (3.43) - extreme outliers
- ZN: Right-skewed
- LSTAT, PTRATIO: Left-skewed
- RM (Rooms): +0.67 correlation - Strongest positive relationship
- More rooms = Higher price
- LSTAT (Lower status %): -0.74 correlation - Strong negative relationship
- Right-skewed distributions in CRIM and B require log transformation
- Outliers in crime rate and racial demographics
- Non-normal target variable distribution requiring transformation
- Distribution analysis using histograms
- Outlier detection with boxplots
- Correlation matrix analysis
- Skewness assessment
- Handle missing values
- Apply log transformation to skewed features (CRIM, B, ZN)
- Normalize/standardize features for model compatibility
- Identify primary predictive features (RM, LSTAT)
- Create derived features if needed
- Handle categorical variables
- Linear Regression baseline
- Evaluate with appropriate metrics
- Prevent bias from outliers and skewed distributions
pip install pandas numpy scikit-learn matplotlib seaborn-
Exploratory Analysis:
jupyter notebook src/eda_comprehensive.ipynb
-
Full Pipeline:
python main.py
import pandas as pd
df = pd.read_csv("../data/data.csv")streamlit run streamlit_app.py
and you must run backend fastapi before making prediction on streamlit app.
In streamlit_app.py, change your request URL from http://localhost:8000/predict to http://backend:8000/predict to allow the containers to talk to each other.
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(10, 6))
sns.regplot(x='RM', y='MEDV', data=df, scatter_kws={'alpha': 0.6}, line_kws={'color': 'red'})
plt.xlabel('Average Number of Rooms per Dwelling (RM)')
plt.ylabel('Median Home Value in $1000s (MEDV)')
plt.title('Room Count vs House Price with Regression Line')
plt.grid(True, alpha=0.3)
plt.show()- Target variable transformation recommended due to non-normal distribution
- Log transformation applied to highly skewed features
- Expected improved model performance after addressing skewness
| File | Purpose |
|---|---|
main.py |
Entry point for the full pipeline |
src/eda.py |
Automated EDA functions |
src/eda_comprehensive.ipynb |
Interactive exploration and visualization |
src/data_ingestion.py |
Data loading utilities |
src/preprocessing.py |
Data cleaning and transformation |
src/feature_engineering.py |
Feature creation and selection |
src/model_trainer.py |
Model training and evaluation |
- Apply transformations to normalize skewed distributions
- Train regression models with processed features
- Cross-validate model performance
- Evaluate using metrics (RΒ², RMSE, MAE)
- Hyperparameter tuning for optimal performance
Saqib Iqbal
This project uses the Boston Housing Dataset for educational purposes.