Skip to content

bchinni/HierarchicalClustering_BinaryOutcome

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Hierarchical Clustering Analysis Tool

ML pipeline for performing agglomerative hierarchical clustering with comprehensive evaluation metrics and visualizations.

Features

  • Multiple Clustering Methods: Complete, single, and average linkage
  • Multiple Distance Metrics: Bray-Curtis, Euclidean, and more
  • Comprehensive Visualizations:
    • Clustermaps with dendrograms
    • Hierarchical clustering dendrograms
    • Color-coded groupings
  • Performance Metrics:
    • Confusion matrix
    • Accuracy
    • Adjusted Rand Index (ARI)
    • Statistical significance testing
  • Automated Output: All results and visualizations saved to organized output directory

Installation

Prerequisites

  • Python 3.8 or higher
  • pip package manager

Setup

Install required packages:

pip install -r requirements.txt

Usage

Basic Usage

  1. Prepare your data in Excel format (.xlsx) with:

    • Features as columns
    • Samples as rows
    • A binary outcome column (0/1)
  2. Place your data file in the project directory as data.xlsx

  3. Run the analysis:

python hierarchical_clustering.py

Custom Configuration

Edit the configuration section in hierarchical_clustering.py:

DATA_PATH = 'your_data.xlsx'      # Path to your data file
OUTCOME_COL = 'binary'             # Name of outcome column
OUTPUT_DIR = 'results'             # Output directory name

Advanced Usage

Use the HierarchicalClusteringAnalyzer class in your own scripts:

from hierarchical_clustering import HierarchicalClusteringAnalyzer

# Initialize analyzer
analyzer = HierarchicalClusteringAnalyzer(
    data_path='data.xlsx',
    outcome_col='binary',
    output_dir='results'
)

# Run analysis with custom parameters
results = analyzer.run_comprehensive_analysis(
    metrics=['euclidean', 'manhattan'],
    methods=['ward', 'average']
)

# Generate dendrogram
linkage_matrix, clusters = analyzer.generate_dendrogram(
    method='complete',
    metric='euclidean'
)

Output Files

All outputs are saved to the results/ directory:

  • Clustermaps: {outcome}_clustermap_{method}_{metric}.png
  • Comprehensive Results: {outcome}_all_results.xlsx
  • Dendrogram: {outcome}_dendrogram_{method}_{metric}.png
  • Dendrogram Metrics: {outcome}_dendrogram_results.xlsx
  • Reordered Data: {outcome}_dendrogram_order.xlsx

Data Format

Your input Excel file should have the following structure:

patient feature1 feature2 ... binary
ID001 0.5 1.2 ... 0
ID002 0.8 0.9 ... 1
  • patient (optional): Sample identifier
  • features: Numeric columns for clustering
  • binary: Binary outcome column (0 or 1)

Methods and Metrics

Linkage Methods

  • Complete: Maximum distance between clusters
  • Single: Minimum distance between clusters
  • Average: Average distance between clusters

Distance Metrics

  • Bray-Curtis: Dissimilarity measure for ecological data
  • Euclidean: Straight-line distance
  • Additional metrics supported by scipy

Evaluation Metrics

  • Confusion Matrix: True positives, false positives, true negatives, false negatives
  • Accuracy: Overall classification accuracy
  • Adjusted Rand Index (ARI): Similarity measure accounting for chance
  • T-test P-value: Statistical significance of cluster separation

Example Output

======================================================================
HIERARCHICAL CLUSTERING ANALYSIS
======================================================================

Running 6 clustering analyses...
[1/6] Processing: complete linkage with braycurtis metric
✓ Saved clustermap: binary_clustermap_complete_braycurtis.png
...

======================================================================
SUMMARY OF RESULTS
======================================================================
     Metric Linkage Method  True Negatives  ...  Accuracy  Adjusted Rand Index
 braycurtis       complete              45  ...     0.876                0.723
...

======================================================================
ANALYSIS COMPLETE
======================================================================
All results saved to: results/

About

Agglomerative hierarchical clustering with comprehensive evaluation metrics and visualizations using a binary outcome

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages