ML pipeline for performing agglomerative hierarchical clustering with comprehensive evaluation metrics and visualizations.
- Multiple Clustering Methods: Complete, single, and average linkage
- Multiple Distance Metrics: Bray-Curtis, Euclidean, and more
- Comprehensive Visualizations:
- Clustermaps with dendrograms
- Hierarchical clustering dendrograms
- Color-coded groupings
- Performance Metrics:
- Confusion matrix
- Accuracy
- Adjusted Rand Index (ARI)
- Statistical significance testing
- Automated Output: All results and visualizations saved to organized output directory
- Python 3.8 or higher
- pip package manager
Install required packages:
pip install -r requirements.txt-
Prepare your data in Excel format (.xlsx) with:
- Features as columns
- Samples as rows
- A binary outcome column (0/1)
-
Place your data file in the project directory as
data.xlsx -
Run the analysis:
python hierarchical_clustering.pyEdit the configuration section in hierarchical_clustering.py:
DATA_PATH = 'your_data.xlsx' # Path to your data file
OUTCOME_COL = 'binary' # Name of outcome column
OUTPUT_DIR = 'results' # Output directory nameUse the HierarchicalClusteringAnalyzer class in your own scripts:
from hierarchical_clustering import HierarchicalClusteringAnalyzer
# Initialize analyzer
analyzer = HierarchicalClusteringAnalyzer(
data_path='data.xlsx',
outcome_col='binary',
output_dir='results'
)
# Run analysis with custom parameters
results = analyzer.run_comprehensive_analysis(
metrics=['euclidean', 'manhattan'],
methods=['ward', 'average']
)
# Generate dendrogram
linkage_matrix, clusters = analyzer.generate_dendrogram(
method='complete',
metric='euclidean'
)All outputs are saved to the results/ directory:
- Clustermaps:
{outcome}_clustermap_{method}_{metric}.png - Comprehensive Results:
{outcome}_all_results.xlsx - Dendrogram:
{outcome}_dendrogram_{method}_{metric}.png - Dendrogram Metrics:
{outcome}_dendrogram_results.xlsx - Reordered Data:
{outcome}_dendrogram_order.xlsx
Your input Excel file should have the following structure:
| patient | feature1 | feature2 | ... | binary |
|---|---|---|---|---|
| ID001 | 0.5 | 1.2 | ... | 0 |
| ID002 | 0.8 | 0.9 | ... | 1 |
- patient (optional): Sample identifier
- features: Numeric columns for clustering
- binary: Binary outcome column (0 or 1)
- Complete: Maximum distance between clusters
- Single: Minimum distance between clusters
- Average: Average distance between clusters
- Bray-Curtis: Dissimilarity measure for ecological data
- Euclidean: Straight-line distance
- Additional metrics supported by scipy
- Confusion Matrix: True positives, false positives, true negatives, false negatives
- Accuracy: Overall classification accuracy
- Adjusted Rand Index (ARI): Similarity measure accounting for chance
- T-test P-value: Statistical significance of cluster separation
======================================================================
HIERARCHICAL CLUSTERING ANALYSIS
======================================================================
Running 6 clustering analyses...
[1/6] Processing: complete linkage with braycurtis metric
✓ Saved clustermap: binary_clustermap_complete_braycurtis.png
...
======================================================================
SUMMARY OF RESULTS
======================================================================
Metric Linkage Method True Negatives ... Accuracy Adjusted Rand Index
braycurtis complete 45 ... 0.876 0.723
...
======================================================================
ANALYSIS COMPLETE
======================================================================
All results saved to: results/