Add CaloClusterGNN training pipeline#7
Conversation
Adds a CaloClusterGNN/ subdirectory containing the full training
pipeline for the GNN calorimeter-clustering algorithm intended as a
parallel to the existing seed+BFS CaloClusterMaker in Mu2e Offline.
The deployed recipe is "CCN+BFS10": CaloClusterNet edge classifier +
BFS-style traversal with ExpandCut = 10 MeV.
Layout (modelled on TrkQual/, but Python-package shaped):
CaloClusterGNN/
README.md how to retrain, frozen hyperparams,
deployment cross-link
setup_env.sh wraps setupmu2e-art.sh + ana 2.6.1
src/
data/ graph builder, calo-entrant truth labels,
normalisation, packed dataset
geometry/ crystalId -> (x, y, disk) loader
models/ SimpleEdgeNet, CaloClusterNet, layers, heads
training/ losses, metrics, trainer
inference/ cluster reconstruction (cluster_reco.py),
postprocess (kept here so train-time eval
scripts work end to end)
scripts/ build/pack/train/tune/evaluate pipeline,
failure audits, cluster-physics eval, ancestry
validation, run1B no-field eval, plotting
configs/ five YAML configs (one per training run)
tests/ 88 unit tests covering all of src/ above
splits/ frozen 35/7/8 v2 split file lists
data/ crystal_geometry.csv + crystal_neighbors.csv +
crystal_map_raw.csv (small lookup tables)
What does NOT live here:
* The deployment-side ONNX export / parity scripts (export_onnx.py,
export_norm_stats.py, validate_onnx.py, dump_parity_payloads.py,
compare_parity_dump.py) and the deploy wrappers
(calo_cluster_net_deploy.py, simple_edge_net_deploy.py) -- those
belong with the Mu2e/Offline integration PR, not the training repo.
* The `.onnx` artifacts themselves (shipped via Mu2e data area, not
versioned in MLTrain -- same convention TrkQual follows).
* Large run outputs and processed graphs (regenerable from
EventNtuple ROOT files via scripts/build_all_graphs.sh).
The v2 training data requires the `calomcsim.ancestorSimIds` branch
added in Mu2e/EventNtuple (PR pending). README cross-links there
once the EventNtuple PR has a number.
Test suite: 88/88 passing in this layout via
`python3 -m unittest discover -s tests -p "test_*.py" -v` after
`source setup_env.sh`.
Both trained models in CaloClusterGNN/ now have a complete training-
to-ONNX path inside MLTrain (consistent with the TrkQual pattern of
shipping conversion scripts alongside training).
New / restored:
* src/models/calo_cluster_net_deploy.py tensor-API wrapper around
CaloClusterNet (no PyG Data, no node-saliency head); used by ONNX
export so torch.onnx.export can trace it.
* src/models/simple_edge_net_deploy.py same shape for SimpleEdgeNet.
No node head to bypass, so it's a thin pass-through.
* scripts/export_onnx.py --model {ccn,sen} flag with
per-model presets (checkpoint, output path, model_version). Stamps
metadata_props {model_version, node_features, edge_features} into
the .onnx after export.
* scripts/export_norm_stats.py writes the train-split z-score
stats next to the .onnx as a flat JSON sidecar so the C++ side
doesn't need a LibTorch dep to read 28 floats.
* scripts/validate_onnx.py --model flag with per-model
preset for tau_edge and tolerance. Asserts:
- max abs-diff edge_logits within tol on the full val split
- zero per-edge threshold flips at tau_edge (proxy for cluster-
reco byte-equivalence with the deployed C++ pipeline)
* tests/test_calo_cluster_net_deploy.py (9 tests)
* tests/test_export_onnx.py (5 tests)
* tests/test_export_norm_stats.py (8 tests)
README extended with an "Exporting a Trained Model to ONNX" section
that documents the full chain for both models, the
metadata_props deployment contract, and the per-model frozen
tau_edge/tol values used by validate_onnx.py.
Test count goes from 88 to 110 (4 conditionally skipped on a fresh
checkout when no trained checkpoint is present locally; this is by
design and the skip messages name the missing file).
Also acknowledges Claude assistance in README.
PR #7 Review: Add CaloClusterGNN training pipelineScope: +18,029 lines — full training pipeline added under 🔴 Bugs / Correctness Issues1. Hardcoded absolute path in committed shell script
cd /exp/mu2e/app/users/wzhou2/projects/calorimeter/GNN
source setup_env.sh
ROOT_DIR=/exp/mu2e/data/users/wzhou2/GNN/root_files_v2Anyone else who clones the repo and follows the README's instructions ( The same hardcoded
These should at least be parameterized; ideally the split files should hold filenames only and the root directory should be a config/CLI option. 2.
|
Summary
Adds a new
CaloClusterGNN/subdirectory containing the full trainingpipeline for a Graph Neural Network calorimeter-clustering algorithm,
intended to run alongside the existing seed+BFS
CaloClusterMakerinMu2e Offline. The deployed recipe is CCN+BFS10:
CaloClusterNetedge classifier followed by BFS-style traversal at
ExpandCut = 10 MeV.Layout follows the convention set by
TrkQual/andTrkPID/— onetop-level subdirectory per algorithm, self-contained.
What's in here
Two model classes train from the same pipeline:
tau_edgeSimpleEdgeNetCaloClusterNetBoth share the input graph (one per calorimeter disk per event,
6 node features + 8 edge features) and z-score normalisation, so
swapping models in deployment is config-only.
Headline result
On the MDC2025 mixed-pileup test set (276,688 events, 481,543
disk-graphs), CCN+BFS10 beats BFS on every downstream-relevant
cluster-physics metric for
E_reco >= 50 MeVclusters (those thatmatter for track finding):
In the 95-110 MeV signal region (47,279 clusters), mean abs(dE)
drops from 0.368 (BFS) to 0.210 (-43%) and mean dr drops from
0.559 mm to 0.460 mm (-18%).
Reproducibility
After
source setup_env.sh:Frozen hyperparameters and the exact recipe values live in
configs/and are documented in the README.
Coordinated PRs
calomcsim.ancestorSimIdstoSimInfo. The v2 training data uses calo-entrant ancestor truth,which requires this branch. Link this PR into the EventNtuple PR
once it has a number.
(
CaloHitGraphMaker,CaloClusterMakerGNN) underOffline/CaloCluster/. Loads the.onnxexported by this repovia
art::ConfigFileLookupPolicy, assertsmetadata_propsagreement (
model_version,node_features,edge_features)against FHiCL, and emits
CaloClusterCollectionunder instancename
"GNN"so existing BFS-reading analyses keep working.C++↔Python parity has been validated byte-exactly on the val
split (100/100 disk-graphs, 8,502 hits) using a parity-dump
analyzer + Python comparison harness.
Tests
The 4 skipped tests are conditional — they exercise loading a real
trained checkpoint or the exported
.onnxand self-skip with a clearmessage when those files aren't in the local checkout (the case for a
fresh clone).
Acknowledgement
Implementation, refactoring, and documentation drafting in this
subdirectory were assisted by Anthropic's Claude (Claude Code). All
scientific decisions, hyperparameter choices, validation results, and
the v1→v2 truth-definition campaign are my own work.