An End-to-End ETL, Database Engineering, and Business Intelligence Architecture
This project architects a complete data pipeline to analyze clinical outcomes, operational efficiency, and financial billing models for a synthetic healthcare network. By migrating over 3 million records from raw CSVs into a structured MySQL relational database, and ultimately into a dynamic Power BI Star Schema, this project uncovers the true drivers of hospital readmissions and demystifies flat-rate DRG billing structures.
Tech Stack: Python (Pandas, NumPy, SQLAlchemy) | MySQL | Power BI | DAX
The raw dataset consisted of highly denormalized, multi-grain clinical records (demographics, lab results, medications, and hospital outcomes).
The ETL Process:
- Extraction & Transformation (Python/Pandas): * Engineered new features including a validated Charlson Comorbidity Index mapping and age-group clustering.
- Normalized pipe-delimited
secondary_diagnosesstrings into a strict 1-to-Many relational bridging table using.explode(). - Preserved structural nulls (e.g.,
days_to_readmission) to protect downstream analytical integrity.
- Normalized pipe-delimited
- Database Loading (MySQL): * Designed a relational schema to handle 1-to-1 patient outcomes alongside 1-to-Many transactional lab results.
- Executed a memory-efficient bulk insert of 2.2 million lab records utilizing optimized chunking (
chunksize=50000).
- Executed a memory-efficient bulk insert of 2.2 million lab records utilizing optimized chunking (
- Data Modeling (SQL Views -> Power BI):
- Constructed a strict Star Schema utilizing SQL Views as a semantic layer.
- Pre-aggregated the 2.2 million-row lab results table using Common Table Expressions (CTEs) within the SQL View to prevent Cartesian product explosions when joining to patient financial outcomes.
During the SQL EDA phase, I tested multiple clinical hypotheses by isolating variables to find the root causes of hospital inefficiencies.
- The Anomaly: Initial queries joining transaction-level diagnoses to patient-level outcomes showed evenly distributed hospital charges (~$18,000) across all diseases, with zero cost variance between ICU and Non-ICU patients.
- The Investigation: Row-count audits revealed a grain mismatch causing hospital bills to duplicate across every historical disease a patient possessed.
- The Conclusion: After isolating the 1-to-1 primary outcome records, the data definitively proved the hospital utilizes Diagnosis-Related Group (DRG) flat-rate billing rather than Fee-For-Service.
When slicing financial outcomes by insurance_type, the data revealed a stark clinical reality:
- Uninsured Patients received the highest "Self-Pay Discounts" (lowest overall bills), but suffered the longest average hospital stays (5.29 days) and the highest in-hospital mortality rate (1.53%).
- Conclusion: Lack of insurance drives delayed care, resulting in patients presenting at the emergency room with much higher acuity.
I ran distinct scenarios to determine the root cause of 30-day hospital readmissions.
- Scenario A (Demographics): Age proved to be the baseline gravity. Readmissions strictly followed the age hierarchy (Senior > Middle Age > Adult).
- Scenario B (Clinical Complexity): When grouping the data purely by a binary "Abnormal Lab" flag, readmission rates hovered at a flat 15%.
- Scenario C (Combined Synthesis): When counting the volume of abnormal lab tests per patient and combining it with the Senior age bracket, a distinct cohort emerged with a 26.3% readmission rate.
- Conclusion: While age is the baseline driver, Age combined with high clinical complexity (volume of failed tests) is the ultimate predictor of readmission.
1. Financial & Operational Overview:

2. Clinical Risk & Readmission Dynamics:

- Clone this repository.
- Download the dataset here and extract the CSVs into the patients/data/ folder.
- Ensure MySQL Server is running locally on port 3306.
- Configure your environment variables for
DB_USER_PROJECTandDB_PASS_PROJECT. - Run the Jupyter Notebook
patient_analytics.ipynbto execute the ETL pipeline. - Open the
Healthcare_Analytics.pbixfile to interact with the dashboard.