GitHub - Darksteel047/Parkinsons_EDA: Exploratory Data Analysis on Parkinson's Disease patient dataset

📖 Vocal-Feature-Analysis-for-Parkinsons

Overview

This project performs a comprehensive exploratory data analysis (EDA) on a Parkinson’s disease dataset. The goal is to uncover patterns, relationships, and distinguishing features between Healthy individuals and Parkinson’s patients.

The dataset contains voice measurement features (Jitter, Shimmer, HNR, etc.) collected from subjects. This EDA helps identify key features for further predictive modeling.

📂 Dataset

Source: https://archive.ics.uci.edu/ml/machine-learning-databases/parkinsons/parkinsons.data
Number of records: 195
Number of features: 22 numeric features + status + name
Target variable: status (0 = Healthy, 1 = Parkinson’s)

Analysis Steps and Key Insights:

1️⃣ Data Loading and Cleaning

Checked the shape and column info along with their data types.
Checked for missing values and duplicate values
Checked the target variable distribution.
Removed unnecessary identifier columns (e.g., name).
Ensured all features are numeric.
Stored a clean copy for analysis.

2️⃣ Descriptive Statistics

Computed summary of the numerical data for each column using the .describe() method.
Identified features with high variance for further analysis.
Observed mean differences between Healthy and Parkinson’s groups for various features.
Features like MDVP:Fo(Hz) and HNR are lower in Parkinson’s patients.
Features like spread1, PPE, and MDVP:Shimmer are higher in Parkinson’s patients.

3️⃣ Univariate Analysis

Plotted distributions of key acoustic features (MDVP:Fo(Hz), HNR, spread1, RPDE, PPE, etc.) by status to visualize how voice characteristics differ between healthy and Parkinson’s individuals.

Insights:

📉 MDVP:Fo(Hz) and HNR values are typically lower for Parkinson’s patients.
📈 spread1, RPDE, and PPE are higher for Parkinson’s patients, indicating greater voice signal irregularity.
The clear shifts in distributions suggest that these features are strong indicators for distinguishing between the two classes.

4️⃣ Bivariate Analysis

Observed differences in median, spread, and outliers using Boxplots.

Insights:

Identified top distinguishing features: MDVP:Fo(Hz), HNR (lower in Parkinson’s), RDPE, spread1, PPE (higher in Parkinson’s)
Individuals with Parkinson's (Status 1) exhibit significantly lower vocal frequencies (MDVP:Fo, MDVP:Flo) and Harmonics-to-Noise Ratio (HNR). Conversely, measures of vocal irregularity and complexity like PPE, RPDE, and spread1 are markedly higher in the Parkinson's group.
Some features (e.g., MDVP:Fhi(Hz)) show medium differences, but high overlap between classes results in being the least effective differentiator.

5️⃣ Correlation Analysis

Computed correlation matrix of all numeric features.
Visualized using a heatmap with a color gradient.
Identified highly correlated features, especially among Jitter and Shimmer measures.
Checked correlation with status to determine top predictive features.

Insights:

spread1, PPE, MDVP:Fo(Hz), MDVP:Flo(Hz), MDVP: Shimmer, HNR are most correlated with Parkinson’s status.
High correlation among Jitter/Shimmer features indicates potential multicollinearity.

5️⃣ Multivariate Analysis

Explored feature interactions using pairplots.
Observed clustering and separation trends between Healthy and Parkinson’s groups.

Insights:

Identified feature pairs that highlight class distinction, e.g., MDVP:Fo(Hz) vs HNR.
The diagonal density plots confirm the findings from the boxplots: spread1, PPE, and MDVP:Shimmer are higher for Parkinson's patients, whereas HNR and MDVP:Fo(Hz) are lower.
The tight correlation between features like spread1 and PPE indicates multicollinearity.
The scatter plots reveal excellent class separation in multi-feature space.

6️⃣ Key Takeaways

Features with high mean difference and low overlap are most discriminative.
Some features are highly correlated with each other; careful feature selection is needed for modeling.
Visual exploration confirms that voice features can distinguish Healthy vs Parkinson’s patients.

🛠️ Tools & Technologies

Category	Tools Used
Data Handling	`pandas`, `numpy`
Visualization	`matplotlib`, `seaborn`
Environment	Jupyter Notebook
Insights & Observations	`chatgpt`,`gemini`

📊 Visualizations

Boxplots for key features by status
Correlation heatmap to check feature relationships
Pairplots for feature interactions

📄 License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
notebook		notebook
plots		plots
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📖 Vocal-Feature-Analysis-for-Parkinsons

Overview

📂 Dataset

Analysis Steps and Key Insights:

🛠️ Tools & Technologies

📊 Visualizations

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

📖 Vocal-Feature-Analysis-for-Parkinsons

Overview

📂 Dataset

Analysis Steps and Key Insights:

🛠️ Tools & Technologies

📊 Visualizations

📄 License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages