This project performs a comprehensive exploratory data analysis (EDA) on a Parkinson’s disease dataset. The goal is to uncover patterns, relationships, and distinguishing features between Healthy individuals and Parkinson’s patients.
The dataset contains voice measurement features (Jitter, Shimmer, HNR, etc.) collected from subjects. This EDA helps identify key features for further predictive modeling.
- Source: https://archive.ics.uci.edu/ml/machine-learning-databases/parkinsons/parkinsons.data
- Number of records: 195
- Number of features: 22 numeric features + status + name
- Target variable: status (0 = Healthy, 1 = Parkinson’s)
1️⃣ Data Loading and Cleaning
- Checked the shape and column info along with their data types.
- Checked for missing values and duplicate values
- Checked the target variable distribution.
- Removed unnecessary identifier columns (e.g., name).
- Ensured all features are numeric.
- Stored a clean copy for analysis.
2️⃣ Descriptive Statistics
- Computed summary of the numerical data for each column using the .describe() method.
- Identified features with high variance for further analysis.
- Observed mean differences between Healthy and Parkinson’s groups for various features.
- Features like MDVP:Fo(Hz) and HNR are lower in Parkinson’s patients.
- Features like spread1, PPE, and MDVP:Shimmer are higher in Parkinson’s patients.
3️⃣ Univariate Analysis
- Plotted distributions of key acoustic features (MDVP:Fo(Hz), HNR, spread1, RPDE, PPE, etc.) by status to visualize how voice characteristics differ between healthy and Parkinson’s individuals.
Insights:
- 📉 MDVP:Fo(Hz) and HNR values are typically lower for Parkinson’s patients.
- 📈 spread1, RPDE, and PPE are higher for Parkinson’s patients, indicating greater voice signal irregularity.
- The clear shifts in distributions suggest that these features are strong indicators for distinguishing between the two classes.
4️⃣ Bivariate Analysis
- Observed differences in median, spread, and outliers using Boxplots.
Insights:
- Identified top distinguishing features: MDVP:Fo(Hz), HNR (lower in Parkinson’s), RDPE, spread1, PPE (higher in Parkinson’s)
- Individuals with Parkinson's (Status 1) exhibit significantly lower vocal frequencies (MDVP:Fo, MDVP:Flo) and Harmonics-to-Noise Ratio (HNR). Conversely, measures of vocal irregularity and complexity like PPE, RPDE, and spread1 are markedly higher in the Parkinson's group.
- Some features (e.g., MDVP:Fhi(Hz)) show medium differences, but high overlap between classes results in being the least effective differentiator.
5️⃣ Correlation Analysis
- Computed correlation matrix of all numeric features.
- Visualized using a heatmap with a color gradient.
- Identified highly correlated features, especially among Jitter and Shimmer measures.
- Checked correlation with status to determine top predictive features.
Insights:
- spread1, PPE, MDVP:Fo(Hz), MDVP:Flo(Hz), MDVP: Shimmer, HNR are most correlated with Parkinson’s status.
- High correlation among Jitter/Shimmer features indicates potential multicollinearity.
5️⃣ Multivariate Analysis
- Explored feature interactions using pairplots.
- Observed clustering and separation trends between Healthy and Parkinson’s groups.
Insights:
- Identified feature pairs that highlight class distinction, e.g., MDVP:Fo(Hz) vs HNR.
- The diagonal density plots confirm the findings from the boxplots: spread1, PPE, and MDVP:Shimmer are higher for Parkinson's patients, whereas HNR and MDVP:Fo(Hz) are lower.
- The tight correlation between features like spread1 and PPE indicates multicollinearity.
- The scatter plots reveal excellent class separation in multi-feature space.
6️⃣ Key Takeaways
- Features with high mean difference and low overlap are most discriminative.
- Some features are highly correlated with each other; careful feature selection is needed for modeling.
- Visual exploration confirms that voice features can distinguish Healthy vs Parkinson’s patients.
| Category | Tools Used |
|---|---|
| Data Handling | pandas, numpy |
| Visualization | matplotlib, seaborn |
| Environment | Jupyter Notebook |
| Insights & Observations | chatgpt,gemini |
- Boxplots for key features by status
- Correlation heatmap to check feature relationships
- Pairplots for feature interactions
This project is licensed under the MIT License.



