CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Overview

Clustrix is a Python distributed computing framework that enables seamless execution of Python functions on remote clusters (SLURM, PBS, SGE, Kubernetes, SSH) using a simple @cluster decorator.

Development Commands

Setup

# Install package with development dependencies
pip install -e ".[dev]"

# Install with Kubernetes support
pip install -e ".[kubernetes,dev]"

Code Quality

CRITICAL: Always run these checks before committing or pushing code:

# Use the comprehensive quality check script (recommended)
python scripts/check_quality.py

# Or run individual checks:
black clustrix/          # Format code
flake8 clustrix/         # Run linter  
mypy clustrix/           # Type checking
pytest --cov=clustrix    # Run tests with coverage

Pre-commit hooks are installed - they will automatically run black, flake8, and mypy before each commit. If any check fails, the commit will be blocked until fixed.

Never push without running quality checks - GitHub Actions will fail if code doesn't pass black, flake8, and mypy.

Architecture

Core Components

@cluster Decorator (clustrix/decorator.py): Main user interface for marking functions for remote execution. Supports resource specification and automatic loop parallelization.

⚠️ REPL Limitation: Functions defined interactively in the Python REPL cannot be serialized because inspect.getsource() cannot access their source code. This affects interactive Python sessions and some notebook environments. Functions must be defined in .py files or environments where source code is preserved.
ClusterExecutor (clustrix/executor.py): Central execution engine handling:
- Job submission to different schedulers (SLURM, PBS, SGE, Kubernetes, SSH)
- SSH connection management via Paramiko
- File transfer via SFTP
- Job monitoring and result collection
- Remote environment setup
Configuration System (clustrix/config.py): Singleton configuration management supporting:
- YAML/JSON file loading
- Hierarchical configuration (defaults → file → runtime)
- Standard location discovery (~/.clustrix/, /etc/clustrix/)
Utilities (clustrix/utils.py):
- Function serialization using cloudpickle/dill
- Environment capture and replication
- AST-based loop detection for parallelization
- Cluster-specific job script generation
Filesystem Utilities (clustrix/filesystem.py): Unified filesystem operations for local and remote clusters:
- ClusterFilesystem class: Core implementation handling local and SSH-based remote operations
- Convenience functions: cluster_ls(), cluster_find(), cluster_stat(), cluster_exists(), cluster_isdir(), cluster_isfile(), cluster_glob(), cluster_du(), cluster_count_files()
- Data classes: FileInfo for file metadata, DiskUsage for directory usage statistics
- Automatic connection management: SSH connections handled transparently
- Config-driven: Uses ClusterConfig to determine local vs remote operations

Key Design Patterns

Strategy Pattern: Different cluster types handled via specific submission methods
Decorator Pattern: Clean API through @cluster decorator
Singleton Pattern: Global configuration instance
Template Pattern: Job script generation with cluster-specific variations

Workflow

User decorates function with @cluster
Function and dependencies serialized with cloudpickle
Job submitted to cluster:
- Remote directory created
- Serialized data uploaded
- Python environment setup (venv/conda)
- Job script generated and submitted
Remote execution with unpickling
Results polled and downloaded

Important Considerations

The project is in beta (v0.1.0) and lacks test coverage
SSH-based clusters require proper key setup or password authentication
Remote environments are recreated based on local pip freeze output
Job scripts are bash-based with scheduler-specific directives
Results communicated via pickled files (result.pkl/error.pkl)
Automatic cleanup of remote files configurable via cleanup_remote_files

Common Tasks

Adding New Cluster Type Support

Add new cluster type to ClusterType enum in config.py
Implement _submit_{type}_job method in ClusterExecutor
Add status checking logic to get_job_status
Update job script generation in utils.py if needed

Using Filesystem Utilities

from clustrix import cluster_ls, cluster_find, cluster_stat, cluster_exists, cluster_glob
from clustrix.config import ClusterConfig

# Configure for local or remote operations
config = ClusterConfig(
    cluster_type="slurm",  # or "local" for local operations
    cluster_host="cluster.edu",
    username="researcher",
    remote_work_dir="/scratch/project"
)

# List directory contents (works locally and remotely)
files = cluster_ls("data/", config)

# Find files by pattern
csv_files = cluster_find("*.csv", "datasets/", config)

# Check file existence
if cluster_exists("results/output.json", config):
    print("Results already computed!")

# Get file information
file_info = cluster_stat("large_dataset.h5", config)
print(f"Dataset size: {file_info.size / 1e9:.1f} GB")

# Use with @cluster decorator for data-driven workflows
@cluster(cores=8)
def process_datasets(config):
    # Find all data files on the cluster
    data_files = cluster_glob("*.csv", "input/", config)
    
    results = []
    for filename in data_files:  # Loop gets parallelized automatically
        # Check file size before processing
        file_info = cluster_stat(filename, config)
        if file_info.size > 100_000_000:  # Large files
            result = process_large_file(filename, config)
        else:
            result = process_small_file(filename, config)
        results.append(result)
    
    return results

Debugging Remote Execution

Check remote logs in ~/.clustrix/jobs/{job_id}/
Enable SSH debug mode in Paramiko connection
Verify environment setup with pip freeze comparison
Check scheduler-specific logs (SLURM: slurm-*.out)

Configuration Priority

Runtime parameters (highest priority)
Configuration file (clustrix.yml)
Environment variables
Default values (lowest priority)

⚠️ MANDATORY PRE-COMMIT WORKFLOW ⚠️

EVERY SINGLE TIME before committing/pushing:

Run quality checks repeatedly: python scripts/pre_push_check.py
- This runs black (formatting), flake8 (linting), mypy (type checking), and pytest (tests)
- It automatically retries up to 5 times until ALL checks pass
- Black may auto-fix formatting issues, so subsequent runs check if everything is clean
Only commit and push when ALL checks pass

CRITICAL: You must run checks repeatedly until they ALL pass. Don't just run once!

Pre-commit hooks will block commits that fail quality checks.

The GitHub Actions CI will fail if code doesn't pass black, flake8, mypy, and pytest. Always run the full cycle locally first.

Testing Guidelines

Test Organization

Unit Tests (tests/ excluding real_world/):

Run in GitHub Actions CI
Mock external dependencies
Fast execution, no external resources required
Included in coverage reports

Real-World Tests (tests/real_world/):

Excluded from GitHub Actions due to SSH/credential requirements
Run automatically in pre-push hook when credentials are available
Test actual cluster functionality, API calls, SSH connections
Marked with @pytest.mark.real_world

Pre-Push Hook Workflow

A custom pre-push hook (.git/hooks/pre-push) automatically runs real-world tests when:

Cluster credentials are available (environment variables, etc.)
Real-world test runner script exists

Hook behavior:

✅ With credentials: Runs all real-world tests and blocks push on failure
⚠️ Without credentials: Skips with warning, allows push
🔧 Script missing: Falls back to basic pytest real-world test run

Running Tests Manually

# All unit tests (CI-compatible)
pytest tests/ -m "not real_world"

# Real-world tests only
pytest tests/real_world/ -m real_world

# Using real-world test runner (when available)
python scripts/run_real_world_tests.py --filesystem
python scripts/run_real_world_tests.py --ssh  
python scripts/run_real_world_tests.py --api
python scripts/run_real_world_tests.py --visual

External Function Validation

External Function Validation: When testing external functions and features (i.e., anything that we cannot directly test locally), we MUST validate that those functions work (without resorting to local fallbacks) at least once before we can check the issue off as completed. Use GitHub issue spec criteria and comments to track what has been validated. We can store API keys, usernames, and/or passwords locally (if we can do so securely) in order to enable this. Once we have verified functionality (again, WITHOUT resorting to fallbacks), we can check off that functionality as "tested" and then use mocked functions or objects in pytests to avoid incurring excessive API fees.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLAUDE.md

Overview

Development Commands

Setup

Code Quality

Architecture

Core Components

Key Design Patterns

Workflow

Important Considerations

Common Tasks

Adding New Cluster Type Support

Using Filesystem Utilities

Debugging Remote Execution

Configuration Priority

⚠️ MANDATORY PRE-COMMIT WORKFLOW ⚠️

Testing Guidelines

Test Organization

Pre-Push Hook Workflow

Running Tests Manually

External Function Validation

FilesExpand file tree

CLAUDE.md

Latest commit

History

CLAUDE.md

File metadata and controls

CLAUDE.md

Overview

Development Commands

Setup

Code Quality

Architecture

Core Components

Key Design Patterns

Workflow

Important Considerations

Common Tasks

Adding New Cluster Type Support

Using Filesystem Utilities

Debugging Remote Execution

Configuration Priority

⚠️ MANDATORY PRE-COMMIT WORKFLOW ⚠️

Testing Guidelines

Test Organization

Pre-Push Hook Workflow

Running Tests Manually

External Function Validation