Skip to content

krishnaura45/price-prophet

Repository files navigation

Price Prophet: An end-to-end MLOps Pipeline for House Price Prediction

Price Prophet is a production-grade machine learning pipeline built to predict house sale prices using a robust, scalable and reproducible MLOps framework.

Python ZenML MLflow Streamlit


FILES & STRUCTURE 📂

  • data/: Raw zipped data from Kaggle
  • extracted_data/: Ingested dataset
  • analysis/: Exploratory notebooks and analyzers
  • src/: Core modules for feature engineering, model building, evaluation
  • steps/: ZenML-defined step-wise modular pipeline stages
  • pipelines/: Training and deployment pipeline definitions
  • pipeline_runs: Pipeline Runs in form of DAG visualizations, from ZenML Dashboard
  • run_pipeline.py: Executes training pipeline
  • run_deployment.py: Executes deployment/inference pipeline
  • app.py: Streamlit interface for user-side predictions
  • sample_predict.py: Local REST inference - single sample
  • sample_batch_predict.py: Local REST inference - batch prediction
  • exported_model/: Artifacts of one of the best models manually saved via MLflow
  • requirements.txt: Python dependencies

IMPORTANT LINKS 🔗


INTRODUCTION

house-image
  • Accurate house price prediction is vital for real estate valuation, investment, and decision-making.
  • Traditional ML workflows often suffer from:
    • Poor reproducibility and pipeline modularity
    • Lack of production-readiness and deployment integration
    • Minimal tracking or model lifecycle management
  • Price Prophet addresses this by building a clean, reproducible, and production-ready ML pipeline from ingestion to deployment.
  • Built with Python, ZenML, MLflow, and Streamlit, it ensures seamless orchestration, experiment tracking, deployment, and user-friendly inference.

PROBLEM DEFINITION

  • Manual Workflows: Traditional house price prediction lacks automation, requiring repetitive preprocessing, model training, and evaluation steps.
  • Pipeline Gaps: Most ML solutions stop at model accuracy, missing crucial components like deployment, tracking, and maintainability.
  • Lack of Production Readiness: Existing approaches don't support reproducible, scalable, or monitorable model deployment in real-world settings.
  • End-to-End MLOps: There is a clear need for a robust, automated pipeline integrating data handling, modeling, versioning, and serving with real-time inference.

OBJECTIVES 🧰

  • Ultimate Aim: Build an end-to-end MLOps pipeline.
  • Perform robust data processing and heavy feature engineering so as to get best model performance.
  • Utilize and compare multiple regression strategies for price prediction.
  • Integrate MLOps tools like ZenML and Mlflow.
  • Build a front-end application for user interaction and visualization.
  • Eensure production readiness by focusing on modularity, reproducibility, version control, and real-time prediction capability.

METHODOLOGY 🔧

Pipeline Workflow

image

Core ML Stages

Stage Description
Data Ingestion Loads and extracts raw housing data from compressed archives (archive.zip).
Initial Preprocessing Cleans missing values and duplicates; prepares dataset for transformation.
Feature Engineering Applies log-transformations and constructs domain-inspired features like Porch, Bath_total, and FinSF.
Outlier Handling Identifies and removes extreme values from critical features such as SalePrice.
Data Splitting Splits data into train/test using stratified sampling while preserving target distribution.
Model Building Trains a stacked ensemble using base models (XGBoost, LightGBM) with meta-model (Linear Regression).
Model Evaluation Computes RMSE, MSE, and R² metrics using MLflow logging and visualization.
Deployment Preparation Logs the model artifacts and expected columns to MLflow for reproducible serving.

MLOps Stack

image

Model Deployment

  • Deployment via MLflow Model Deployer Service (Not suitable for Windows OS)
    image

  • Manual MLflow Model Serving via REST API (Works for MAC/Windows OS)
    image

Inference

  • Batch Inference (Local REST API):

    • Once the model is served manually using MLflow, predictions can be made by sending input data (as JSON) via HTTP POST to the /invocations endpoint.
    • A sample_batch_predict.py script is used to load a .csv file, send data to the model server, and save predictions in predictions.csv.
  • Real-Time Inference (Streamlit Application):

    • A user-friendly UI built with Streamlit allows manual input or CSV uploads.
    • Sends the data to the same REST endpoint and displays predicted house prices instantly.
    • Supports downloading predictions and visualization inside the web app.

RESULTS 📊

  • High Level EDA image

  • Comparative Model Evaluation Metrics
    image

  • Streamlit Application (Interface) image

  • Making Predictions on App using second mode image


INSTALLATION 🤖

To set up the project on your local machine, follow these steps:

  1. Clone the repository:
    https://github.com/krishnaura45/price-prophet.git
    cd price-prophet
  1. Install dependencies:
    pip install -r requirements.txt
  1. Run training pipeline:
    python run_pipeline.py
  1. Serve model manually (use MLflow UI to fetch run ID):
    mlflow models serve -m "runs:/<your_run_id>/model" -p 1234 --no-conda
  1. Run deployment pipeline:
    python run_deployment.py
  1. Run the Streamlit app:
    streamlit run app.py

CONTRIBUTING

  • Fork the repository.
  • Create a new branch.
  • Commit changes with clear messages.
  • Submit a pull request.
  • Ensure new features are tested and documented.

TECH STACK

Pandas Scikit-Learn NumPy Matplotlib Stacking Ensemble


FEATURES 🚀

  • 🔄 Modular ZenML Steps (each in steps/)
  • 🧐 Advanced EDA and feature insights (analysis/)
  • 🪤 Model evaluation with proper metrics
  • 🚪 Manual model deployment (you control what gets served)
  • 🔗 Streamlit App for UI-based input, visualization and download

FUTURE SCOPE 🔮

  • Cloud-Native Deployment: Containerize the pipeline using Docker and orchestrate via Kubernetes to enable scalable, consistent, and production-ready deployments across cloud platforms.
  • Drift Detection & AutoML: Implement data drift monitoring (e.g., with Evidently/WhyLabs) and integrate AutoML frameworks for continual model retraining and optimization.
  • Model Explainability: Enhance interpretability using SHAP or LIME and display visual explanations in Streamlit for better decision trust and transparency.

REFERENCES

  1. ZenML Docs - https://docs.zenml.io/
  2. MLflow Docs - https://mlflow.org/docs/latest/index.html
  3. CatBoost Documentation - https://catboost.ai/en/docs/

Contributors 🧑‍💼

  • Krishna Dubey (Pipeline design, ML modeling, deployment, UI dev)