Cosmic Ray Veto ML training code by sam-grant · Pull Request #6 · Mu2e/MLTrain

sam-grant · 2026-04-29T21:14:45Z

Cosmic Ray Veto ML training code

Adds CrvCosmic/, an XGBoost-based classifier for vetoing cosmic-ray-induced backgrounds, ported from sam-grant/mu2e-cosmic. The model takes per-coincidence CRV and tracker features and outputs a probability that the coincidence matched with the track is cosmic-induced. Trained on CosmicCRYSignalAllOnSpillTriggered (pure cosmics) vs. CeEndpointMix2BBTriggered (beam + pileup).

Only the cuts and helpers actually used by the ML preselection are ported across from sam-grant/mu2e-cosmic, the non-ML parts of the upstream cosmic-background framework are excluded.

Pipeline

Process: Read ROOT EventNtuple files, apply the preprocessing cutset, flatten per-coincidence features into awkward output.
Assemble: Load processed CRY and CE-mix outputs, label, combine, and produce K-fold indices for nested cross-validation (CV).
Train: Fit XGBoost on each fold, find the per-fold operating threshold at a target veto efficiency, then retrain on the full set with the CV-averaged threshold and export to .ubj.
Validate: ROC AUC, score distributions, feature importance, and a "money table" comparing the ML model against the simple ∆t cut baseline.
Optimise: (optional) grid search over XGBoost hyperparameters via k-fold nested CV, minimising deadtime at the target veto efficiency.

Layout

config/cuts.yaml: Cutset definitions; MLPreprocess is the only one defined here
src/core/: Cut flow & feature helpers, postprocessing combiners
src/ml/:MLProcessor, LoadML, AssembleDataset, Train, Validate, Optimise
src/utils/: IO, histogram booking, plotting, mu2e.mplstyle
run/run_ml_prep.py: Entry point for processing ROOT files into parquet/pkl for training
notebooks/: assemble, feature engineering, optimise, train, validate

Dependencies

Requires Mu2e/pyutils (provided by pyenv ana) for pyutils.pycut.CutManager, plus xgboost, scikit-learn, awkward, pandas, hist, pyarrow, h5py, joblib, pyyaml.

Testing

Ran end-to-end on the gpvms (commit b78cf8f)

sam-grant · 2026-04-30T15:30:46Z

@oksuzian @AndrewEdmonds11 would you mind taking a look? I'm not able to add reviewers :)

oksuzian · 2026-05-16T16:01:26Z

PR #6 Review: Cosmic Ray Veto ML training code

Summary

Author: @sam-grant • State: open, mergeable (clean), no reviews yet, 1 issue comment, 0 review comments
Scope: +9,162 / -0 lines across 24 new files — adds a self-contained CrvCosmic/ ML pipeline (XGBoost cosmic veto), ported from sam-grant/mu2e-cosmic
Risk: Medium — new isolated subdirectory, but big surface area with several latent bugs and reproducibility concerns

Issues found

🔴 Bugs

optimise.py uses incompatible data shape in grid_search (line ~50). It calls Train(self.data, ...) where self.data is the dict from assemble_dataset() containing X, y, outer_folds, etc. But Train.__init__ indexes data["X_train"], data["X_test"], data["y_train"], ... — those keys don't exist on the assemble dict. grid_search will KeyError on first run. Only grid_search_cv paths through X_train/metadata_train correctly. → Either remove grid_search or require callers to pass a get_full_train_data(data) / get_fold_data(...) dict.
optimise.py grid_search_cv also assumes X_train/y_train/metadata_train keys (lines ~135–140), but AssembleDataset.assemble_dataset() returns top-level X/y/metadata. Same KeyError. The notebook must be doing one of these conversions implicitly; the public API is inconsistent.
process.py logger overwritten (lines ~138–142 inside __init__): logger is created as [MLPreprocess], then re-created as [MLPreprocessor] with no verbosity argument — silently drops the user-provided verbosity for the second instance. Remove the duplicate block.
run_ml_prep.py cuts_to_toggle argument unused — run() accepts it but never forwards to MLProcessor.
run_ml_prep.py missing newline at EOF; also train.py same.
assemble.py hardcoded col_to_drop (line ~190): comment explicitly flags this as a smell. If the upstream process.py changes feature names, training silently drops/keeps wrong columns. Make this driven by feature_set or assert columns exist.
train.py _fit_keras_model reuses the same compiled model instance across folds — in train_cv, every fold instantiates a new Train, but if a user passes a Keras model (instance, not class), all folds share weights. Not triggered for the default XGBoost path, but a footgun.
validate.py find_threshold ties-handling: uses event_max_proba >= thr for the scan but also picks above_target[-1] (highest threshold ≥ target). With a coarse grid of 10,000 points this is fine, but signal/veto efficiency values are reported off the same >= so the operating-point math is self-consistent — just call out that signal_efficiency = 1 - deadtime here is defined as 1 - FPR, not the usual ML sense. Misleading naming.
analyse.py within_t0err cut is registered only-via at_trk_mid mask but the if self.on_spill: block for within_t0 means the two cuts can disagree about which segments are masked. within_t0err is always defined regardless of on_spill, but within_t0 only on-spill — a cuts.yaml with within_t0: true, on_spill: false would silently miss the cut at runtime (no error).

🟡 Reproducibility / hygiene

No tests. Description says "ran end-to-end on the gpvms" — that's manual. At minimum, add a smoke test that imports each module and runs MLProcessor on the --test fixture.
warnings.filterwarnings("ignore") in assemble.py (module-level) suppresses warnings globally for any importer. Move into a context manager or remove.
sys.path.extend([...]) everywhere — src/core/analyse.py, src/ml/process.py, src/ml/assemble.py, src/ml/load.py, src/ml/optimise.py, run/run_ml_prep.py all mutate sys.path. Make CrvCosmic a proper package with __init__.py re-exports and use relative imports; today these __init__.py are empty.
Hardcoded run_str = "k" in run_ml_prep.py and default run="h" in LoadML, run="j" in Train/Optimise, run="k" in AssembleDataset — inconsistent defaults invite silently reading the wrong directory. Pick one source of truth (env var, CLI arg, or config file).
LoadML.get_results() hardcodes tags ("CRY_onspill-LH_aw", "CE_mix2BB_onspill-LH_aw") that must match run_ml_prep.py configs. Pull these from a shared config.
feature_set arg defaults differ: MLProcessor default = "trk", but run_ml_prep.py test configs use "crv". The "trk" branch in _process_array doesn't set event/subrun which downstream assemble.py requires → broken if anyone uses the default.
Large amount of commented-out code in process.py (the coinc_idx, _test, alternate branches sections). Strip before merging.
.ipynb files committed with outputs (5,320 lines total). Strip outputs (nbstripout) — notebooks dominate the diff and bloat the repo.
Bare except Exception as e everywhere in analyse.py and process.py, all re-raised after a log — adds noise without adding context. Consider letting exceptions propagate naturally except where you can recover.
No license header consistency — some files have Sam Grant 2025, most don't.

🟢 Nits

analyse.py: Logger print_prefix mismatch between MLProcessor ([MLPreprocess] then [MLPreprocessor]).
optimise.py: imports matplotlib.pyplot as plt at module top — Mu2e style only applied via Plot(verbosity=0); results depend on import order.
validate.py plot_score_distribution_cv uses alpha=0.1 per fold — invisible for >10 folds.
cuts.yaml MLPreprocess sets within_coinc_start_time: false twice (it's already false in defaults — redundant).

Merge readiness

Not ready as-is. The two KeyError bugs in optimise.py and the silent-default feature_set="trk" issue are blockers — the public API doesn't match what notebooks/runners actually pass.

Suggested next steps

Fix grid_search / grid_search_cv data contract.
Make feature_set="crv" the default (or remove the "trk" branch).
Convert CrvCosmic into an importable package; drop sys.path hacks.
Strip notebook outputs and dead/commented code.
Add a minimal pytest that imports all modules + runs the --test path on tiny stub inputs.

Want me to (1) open issues for the blocker bugs, (2) draft a PR with the import/package cleanup, or (3) walk through a specific module (e.g. validate.py threshold math) in more depth?

sam-grant added 3 commits April 28, 2026 16:00

Porting CRV ML code from sam-grant/mu2e-cosmic

b0df48f

Reset max workers to auto-scale

f6fb2b3

Ran full pipeline on MLTrain (on gpvms)

b78cf8f

sam-grant marked this pull request as ready for review April 29, 2026 21:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cosmic Ray Veto ML training code#6

Cosmic Ray Veto ML training code#6
sam-grant wants to merge 3 commits into
Mu2e:mainfrom
sam-grant:sgrant/crv-cosmic

sam-grant commented Apr 29, 2026 •

edited

Loading

Uh oh!

sam-grant commented Apr 30, 2026

Uh oh!

oksuzian commented May 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sam-grant commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!