FilterFragment is an archival snapshot of the quality-score filtering evaluation used in Chapter 2 of the doctoral thesis by Enes Kemal Ergin.
The repository preserves the Jupyter notebooks used to benchmark Spectronaut quality scores, quantify the consequences of mass-label misassignment, and measure the impact of threshold-based filtering on peptide identification confidence in N-terminal biotin-labeled HeLa DIA proteomics experiments. It is shared as a citable research artifact for understanding the thesis analysis — not as a generalized proteomics tool or reusable software package.
- A thesis-analysis snapshot of the quality-score filtering evaluation from Chapter 2.
- A record of the statistical and visual analyses used to select and validate filtering thresholds for biotin-labeled DIA data.
- A set of HeLa analysis notebooks with frozen outputs from the thesis.
- A small library of reusable helper modules (
src/) that supported the notebook workflows.
- A packaged software library or command-line tool.
- A general-purpose pipeline for arbitrary DIA quality filtering or peptide-level data cleaning.
- A fully self-contained rerun from a fresh clone — raw and processed data files are not distributed with this repository.
- A validated or generalized framework for non-Spectronaut or non-HeLa workflows.
In biotin-based N-terminomics, cell-surface or N-terminal peptides are selectively enriched via NHS ester chemistry and identified by their biotin modification. Two sources of analytical error are directly evaluated in this repository.
Spectronaut search results using the correct biotin mass (UniMod:3, +226.078 Da) are compared against results generated with an incorrect mass assignment (UniMod:92, +339.162 Da). The wrong-mass search substantially degrades N-terminal annotation accuracy — from approximately 65% to approximately 35% — directly motivating careful search-parameter selection and serving as a cautionary baseline for the filtering work.
Eight precursor-level quality scores available in Spectronaut exports are evaluated as candidate filters for removing false-positive peptide identifications. All are output per elution group (EG) or fragment group (FG) by Spectronaut and are available in standard report exports.
| Score | What it measures | Practical behaviour |
|---|---|---|
FG.ShapeQualityScore (MS1) |
Similarity between the observed MS1 extracted-ion chromatogram (XIC) shape and the expected elution profile | Range 0–1; higher = better peak shape in the precursor channel. Selected threshold: 0.45 |
FG.ShapeQualityScore (MS2) |
Same shape-similarity metric applied to the MS2 fragment XIC | Range 0–1; complements the MS1 shape score; slightly less discriminating for biotin-labeled data |
FG.ShapeQualityScore |
Combined (aggregate) fragment-group shape quality across both MS1 and MS2 channels | Range 0–1; summarises overall chromatographic fidelity |
EG.NormalizedCscore |
Spectronaut's confidence score normalised within a run to account for systematic score shifts | Higher = more confident elution group assignment; run-normalised, so suitable for cross-run comparison |
EG.Cscore |
Raw elution-group confidence score from Spectronaut's target-decoy scoring model | Higher = more confident; not normalised across runs |
EG.SignalToNoise |
Ratio of peak signal to local baseline noise in the elution window | Higher = cleaner signal; sensitive to background complexity and peak integration window quality |
EG.Noise |
Absolute noise level estimate in the elution window | Lower = cleaner signal; useful as a secondary filter but exhibits high instrument-to-instrument variability |
EG.IntCorrScore |
Pearson correlation between observed and predicted fragment intensity patterns within the elution group | Range −1 to 1; higher = better agreement with spectral library; rewards co-elution fidelity |
Each score is evaluated using ROC curve analysis, Youden's J-score for optimal threshold selection, and distributional divergence tests (Jensen–Shannon divergence, Kolmogorov–Smirnov, Wasserstein distance). FG.ShapeQualityScore (MS1) at a threshold of 0.45 consistently emerges as the most effective single filter for separating confidently identified biotin-labeled peptides from false positives.
Applying the selected threshold reduces peptide-level variability (coefficient of variation across replicates), preserves N-terminal labeling specificity, and improves protein-level detection consistency — confirming that the chosen score and threshold are both statistically and biologically motivated.
FilterFragment/
├── HeLa/ # Thesis notebooks with frozen outputs
│ ├── ImpactOfFiltering.ipynb
│ ├── compare_wrongMass.ipynb
│ └── scoreComparison.ipynb
├── src/ # Helper modules used by the notebooks
│ ├── utils.py
│ └── plots.py
├── LICENSE
├── CITATION.cff
└── README.mdThe HeLa/ directory contains the three core analysis notebooks from the thesis chapter. All notebooks are preserved with frozen outputs and should be interpreted as archival research artifacts.
| Notebook | Description |
|---|---|
ImpactOfFiltering.ipynb |
Evaluates the effect of FG.ShapeQualityScore (MS1) threshold filtering on peptide counts, protein detection, CV distributions across replicates, and biotin enrichment specificity (terminus vs. lysine vs. both) |
compare_wrongMass.ipynb |
Compares peptide identification depth and N-terminal annotation accuracy between the correct-mass (UniMod:3) and wrong-mass (UniMod:92) biotin search parameters, across DirectDIA, unfiltered library, and filtered library workflows |
scoreComparison.ipynb |
Benchmarks the eight Spectronaut quality scores using ROC curves, AUC, Youden's J-score, Jensen–Shannon divergence, KS test, and Wasserstein distance; identifies the optimal score and decision threshold |
All notebooks import helper functions from src/utils.py and src/plots.py.
Requires Python 3.8+.
pip install -r requirements.txtrequirements.txt pins lower-bound versions for all packages used across the notebooks and helper modules: numpy, pandas, matplotlib, seaborn, scipy, biopython, and jupyter.
Important: The notebooks depend on local
data/inputs from Spectronaut exports and FASTA reference files that are not distributed with this repository. Cells will not execute from a fresh clone without the corresponding data files.
- The notebooks in
HeLa/are preserved with frozen outputs as part of the thesis record and can be read in full without re-executing. - Raw and processed Spectronaut data are not included in this repository.
- This repository is best understood as a documented analysis snapshot, not as a one-command reproducible pipeline.
- Helper utilities in
src/(FASTA parsing, CV calculation, figure export, distribution plotting) are general enough to be reused independently of the thesis data.
- The analysis is scoped to Spectronaut-specific quality scores and output column formats; other DIA software would require separate score evaluation.
- Threshold recommendations are validated on biotin-enriched HeLa samples measured on a specific instrument configuration. Transferability to other sample types, labeling protocols, or acquisition settings is not established here.
- The full notebook execution requires proprietary Spectronaut reports and raw FASTA reference files that are not distributed.
- This is not a peer-reviewed software artifact; it reflects the analytical decisions made at the time of thesis writing.
All content in this repository — including source code, notebooks, README prose, helper modules, and frozen outputs — is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License.
See LICENSE for the full license text and terms.
If you draw on the analysis or conclusions from this work, please cite the doctoral thesis that this repository accompanies:
@phdthesis{ergin2024thesis,
author = {Ergin, Enes Kemal},
title = {{Computational interrogation of proteoform dynamics in pediatric cancer}},
school = {University of British Columbia},
year = {2024},
url = {https://open.library.ubc.ca/soa/cIRcle/collections/ubctheses/24/items/1.0448334}
}If you specifically need to reference this repository as a code or notebook artifact:
@misc{ergin2024filterfragment,
author = {Ergin, Enes Kemal},
title = {{FilterFragment: Quality-Score Filtering Evaluation for Biotin-Labeled
Peptide Identification in DIA Proteomics}},
year = {2024},
howpublished = {\url{https://github.com/eneskemalergin/FilterFragment}},
note = {Archival thesis analysis snapshot. License: CC BY-NC 4.0}
}- Thesis: Computational interrogation of proteoform dynamics in pediatric cancer
- ORCID: Enes Kemal Ergin
Through the gate, the true signal,
Static falls away.