🏥 Healthcare Clinical & Financial Analytics Pipeline

An End-to-End ETL, Database Engineering, and Business Intelligence Architecture

📌 Executive Summary

This project architects a complete data pipeline to analyze clinical outcomes, operational efficiency, and financial billing models for a synthetic healthcare network. By migrating over 3 million records from raw CSVs into a structured MySQL relational database, and ultimately into a dynamic Power BI Star Schema, this project uncovers the true drivers of hospital readmissions and demystifies flat-rate DRG billing structures.

Tech Stack: Python (Pandas, NumPy, SQLAlchemy) | MySQL | Power BI | DAX

🏗️ Architecture & Data Engineering

The raw dataset consisted of highly denormalized, multi-grain clinical records (demographics, lab results, medications, and hospital outcomes).

The ETL Process:

Extraction & Transformation (Python/Pandas): * Engineered new features including a validated Charlson Comorbidity Index mapping and age-group clustering.
- Normalized pipe-delimited secondary_diagnoses strings into a strict 1-to-Many relational bridging table using .explode().
- Preserved structural nulls (e.g., days_to_readmission) to protect downstream analytical integrity.
Database Loading (MySQL): * Designed a relational schema to handle 1-to-1 patient outcomes alongside 1-to-Many transactional lab results.
- Executed a memory-efficient bulk insert of 2.2 million lab records utilizing optimized chunking (chunksize=50000).
Data Modeling (SQL Views -> Power BI):
- Constructed a strict Star Schema utilizing SQL Views as a semantic layer.
- Pre-aggregated the 2.2 million-row lab results table using Common Table Expressions (CTEs) within the SQL View to prevent Cartesian product explosions when joining to patient financial outcomes.

🔍 Exploratory Data Analysis & Key Insights

During the SQL EDA phase, I tested multiple clinical hypotheses by isolating variables to find the root causes of hospital inefficiencies.

Insight 1: Demystifying the "Cartesian Smear" & DRG Billing

The Anomaly: Initial queries joining transaction-level diagnoses to patient-level outcomes showed evenly distributed hospital charges (~$18,000) across all diseases, with zero cost variance between ICU and Non-ICU patients.
The Investigation: Row-count audits revealed a grain mismatch causing hospital bills to duplicate across every historical disease a patient possessed.
The Conclusion: After isolating the 1-to-1 primary outcome records, the data definitively proved the hospital utilizes Diagnosis-Related Group (DRG) flat-rate billing rather than Fee-For-Service.

Insight 2: Uninsured Mortality Disparities

When slicing financial outcomes by insurance_type, the data revealed a stark clinical reality:

Uninsured Patients received the highest "Self-Pay Discounts" (lowest overall bills), but suffered the longest average hospital stays (5.29 days) and the highest in-hospital mortality rate (1.53%).
Conclusion: Lack of insurance drives delayed care, resulting in patients presenting at the emergency room with much higher acuity.

Insight 3: The 25% Readmission Spike

I ran distinct scenarios to determine the root cause of 30-day hospital readmissions.

Scenario A (Demographics): Age proved to be the baseline gravity. Readmissions strictly followed the age hierarchy (Senior > Middle Age > Adult).
Scenario B (Clinical Complexity): When grouping the data purely by a binary "Abnormal Lab" flag, readmission rates hovered at a flat 15%.
Scenario C (Combined Synthesis): When counting the volume of abnormal lab tests per patient and combining it with the Senior age bracket, a distinct cohort emerged with a 26.3% readmission rate.
Conclusion: While age is the baseline driver, Age combined with high clinical complexity (volume of failed tests) is the ultimate predictor of readmission.

📊 Power BI Dashboard

1. Financial & Operational Overview:

2. Clinical Risk & Readmission Dynamics:

🚀 How to Run Locally

Clone this repository.
Download the dataset here and extract the CSVs into the patients/data/ folder.
Ensure MySQL Server is running locally on port 3306.
Configure your environment variables for DB_USER_PROJECT and DB_PASS_PROJECT.
Run the Jupyter Notebook patient_analytics.ipynb to execute the ETL pipeline.
Open the Healthcare_Analytics.pbix file to interact with the dashboard.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
dashboard		dashboard
images		images
notebooks		notebooks
project report		project report
sql		sql
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🏥 Healthcare Clinical & Financial Analytics Pipeline

📌 Executive Summary

🏗️ Architecture & Data Engineering

🔍 Exploratory Data Analysis & Key Insights

Insight 1: Demystifying the "Cartesian Smear" & DRG Billing

Insight 2: Uninsured Mortality Disparities

Insight 3: The 25% Readmission Spike

📊 Power BI Dashboard

🚀 How to Run Locally

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

🏥 Healthcare Clinical & Financial Analytics Pipeline

📌 Executive Summary

🏗️ Architecture & Data Engineering

🔍 Exploratory Data Analysis & Key Insights

Insight 1: Demystifying the "Cartesian Smear" & DRG Billing

Insight 2: Uninsured Mortality Disparities

Insight 3: The 25% Readmission Spike

📊 Power BI Dashboard

🚀 How to Run Locally

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages