Skip to content

Cosmic Ray Veto ML training code#6

Open
sam-grant wants to merge 3 commits into
Mu2e:mainfrom
sam-grant:sgrant/crv-cosmic
Open

Cosmic Ray Veto ML training code#6
sam-grant wants to merge 3 commits into
Mu2e:mainfrom
sam-grant:sgrant/crv-cosmic

Conversation

@sam-grant
Copy link
Copy Markdown

@sam-grant sam-grant commented Apr 29, 2026

Cosmic Ray Veto ML training code

Adds CrvCosmic/, an XGBoost-based classifier for vetoing cosmic-ray-induced backgrounds, ported from sam-grant/mu2e-cosmic. The model takes per-coincidence CRV and tracker features and outputs a probability that the coincidence matched with the track is cosmic-induced. Trained on CosmicCRYSignalAllOnSpillTriggered (pure cosmics) vs. CeEndpointMix2BBTriggered (beam + pileup).

Only the cuts and helpers actually used by the ML preselection are ported across from sam-grant/mu2e-cosmic, the non-ML parts of the upstream cosmic-background framework are excluded.

Pipeline

  1. Process: Read ROOT EventNtuple files, apply the preprocessing cutset, flatten per-coincidence features into awkward output.
  2. Assemble: Load processed CRY and CE-mix outputs, label, combine, and produce K-fold indices for nested cross-validation (CV).
  3. Train: Fit XGBoost on each fold, find the per-fold operating threshold at a target veto efficiency, then retrain on the full set with the CV-averaged threshold and export to .ubj.
  4. Validate: ROC AUC, score distributions, feature importance, and a "money table" comparing the ML model against the simple ∆t cut baseline.
  5. Optimise: (optional) grid search over XGBoost hyperparameters via k-fold nested CV, minimising deadtime at the target veto efficiency.

Layout

  • config/cuts.yaml: Cutset definitions; MLPreprocess is the only one defined here
  • src/core/: Cut flow & feature helpers, postprocessing combiners
  • src/ml/:MLProcessor, LoadML, AssembleDataset, Train, Validate, Optimise
  • src/utils/: IO, histogram booking, plotting, mu2e.mplstyle
  • run/run_ml_prep.py: Entry point for processing ROOT files into parquet/pkl for training
  • notebooks/: assemble, feature engineering, optimise, train, validate

Dependencies

Requires Mu2e/pyutils (provided by pyenv ana) for pyutils.pycut.CutManager, plus xgboost, scikit-learn, awkward, pandas, hist, pyarrow, h5py, joblib, pyyaml.

Testing

Ran end-to-end on the gpvms (commit b78cf8f)

@sam-grant sam-grant marked this pull request as ready for review April 29, 2026 21:21
@sam-grant
Copy link
Copy Markdown
Author

@oksuzian @AndrewEdmonds11 would you mind taking a look? I'm not able to add reviewers :)

@oksuzian
Copy link
Copy Markdown
Contributor

PR #6 Review: Cosmic Ray Veto ML training code

Summary

  • Author: @sam-grantState: open, mergeable (clean), no reviews yet, 1 issue comment, 0 review comments
  • Scope: +9,162 / -0 lines across 24 new files — adds a self-contained CrvCosmic/ ML pipeline (XGBoost cosmic veto), ported from sam-grant/mu2e-cosmic
  • Risk: Medium — new isolated subdirectory, but big surface area with several latent bugs and reproducibility concerns

Issues found

🔴 Bugs

  1. optimise.py uses incompatible data shape in grid_search (line ~50). It calls Train(self.data, ...) where self.data is the dict from assemble_dataset() containing X, y, outer_folds, etc. But Train.__init__ indexes data["X_train"], data["X_test"], data["y_train"], ... — those keys don't exist on the assemble dict. grid_search will KeyError on first run. Only grid_search_cv paths through X_train/metadata_train correctly. → Either remove grid_search or require callers to pass a get_full_train_data(data) / get_fold_data(...) dict.

  2. optimise.py grid_search_cv also assumes X_train/y_train/metadata_train keys (lines ~135–140), but AssembleDataset.assemble_dataset() returns top-level X/y/metadata. Same KeyError. The notebook must be doing one of these conversions implicitly; the public API is inconsistent.

  3. process.py logger overwritten (lines ~138–142 inside __init__): logger is created as [MLPreprocess], then re-created as [MLPreprocessor] with no verbosity argument — silently drops the user-provided verbosity for the second instance. Remove the duplicate block.

  4. run_ml_prep.py cuts_to_toggle argument unusedrun() accepts it but never forwards to MLProcessor.

  5. run_ml_prep.py missing newline at EOF; also train.py same.

  6. assemble.py hardcoded col_to_drop (line ~190): comment explicitly flags this as a smell. If the upstream process.py changes feature names, training silently drops/keeps wrong columns. Make this driven by feature_set or assert columns exist.

  7. train.py _fit_keras_model reuses the same compiled model instance across folds — in train_cv, every fold instantiates a new Train, but if a user passes a Keras model (instance, not class), all folds share weights. Not triggered for the default XGBoost path, but a footgun.

  8. validate.py find_threshold ties-handling: uses event_max_proba >= thr for the scan but also picks above_target[-1] (highest threshold ≥ target). With a coarse grid of 10,000 points this is fine, but signal/veto efficiency values are reported off the same >= so the operating-point math is self-consistent — just call out that signal_efficiency = 1 - deadtime here is defined as 1 - FPR, not the usual ML sense. Misleading naming.

  9. analyse.py within_t0err cut is registered only-via at_trk_mid mask but the if self.on_spill: block for within_t0 means the two cuts can disagree about which segments are masked. within_t0err is always defined regardless of on_spill, but within_t0 only on-spill — a cuts.yaml with within_t0: true, on_spill: false would silently miss the cut at runtime (no error).

🟡 Reproducibility / hygiene

  • No tests. Description says "ran end-to-end on the gpvms" — that's manual. At minimum, add a smoke test that imports each module and runs MLProcessor on the --test fixture.
  • warnings.filterwarnings("ignore") in assemble.py (module-level) suppresses warnings globally for any importer. Move into a context manager or remove.
  • sys.path.extend([...]) everywheresrc/core/analyse.py, src/ml/process.py, src/ml/assemble.py, src/ml/load.py, src/ml/optimise.py, run/run_ml_prep.py all mutate sys.path. Make CrvCosmic a proper package with __init__.py re-exports and use relative imports; today these __init__.py are empty.
  • Hardcoded run_str = "k" in run_ml_prep.py and default run="h" in LoadML, run="j" in Train/Optimise, run="k" in AssembleDataset — inconsistent defaults invite silently reading the wrong directory. Pick one source of truth (env var, CLI arg, or config file).
  • LoadML.get_results() hardcodes tags ("CRY_onspill-LH_aw", "CE_mix2BB_onspill-LH_aw") that must match run_ml_prep.py configs. Pull these from a shared config.
  • feature_set arg defaults differ: MLProcessor default = "trk", but run_ml_prep.py test configs use "crv". The "trk" branch in _process_array doesn't set event/subrun which downstream assemble.py requires → broken if anyone uses the default.
  • Large amount of commented-out code in process.py (the coinc_idx, _test, alternate branches sections). Strip before merging.
  • .ipynb files committed with outputs (5,320 lines total). Strip outputs (nbstripout) — notebooks dominate the diff and bloat the repo.
  • Bare except Exception as e everywhere in analyse.py and process.py, all re-raised after a log — adds noise without adding context. Consider letting exceptions propagate naturally except where you can recover.
  • No license header consistency — some files have Sam Grant 2025, most don't.

🟢 Nits

  • analyse.py: Logger print_prefix mismatch between MLProcessor ([MLPreprocess] then [MLPreprocessor]).
  • optimise.py: imports matplotlib.pyplot as plt at module top — Mu2e style only applied via Plot(verbosity=0); results depend on import order.
  • validate.py plot_score_distribution_cv uses alpha=0.1 per fold — invisible for >10 folds.
  • cuts.yaml MLPreprocess sets within_coinc_start_time: false twice (it's already false in defaults — redundant).

Merge readiness

Not ready as-is. The two KeyError bugs in optimise.py and the silent-default feature_set="trk" issue are blockers — the public API doesn't match what notebooks/runners actually pass.

Suggested next steps

  1. Fix grid_search / grid_search_cv data contract.
  2. Make feature_set="crv" the default (or remove the "trk" branch).
  3. Convert CrvCosmic into an importable package; drop sys.path hacks.
  4. Strip notebook outputs and dead/commented code.
  5. Add a minimal pytest that imports all modules + runs the --test path on tiny stub inputs.

Want me to (1) open issues for the blocker bugs, (2) draft a PR with the import/package cleanup, or (3) walk through a specific module (e.g. validate.py threshold math) in more depth?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants