FilterFragment

FilterFragment is an archival snapshot of the quality-score filtering evaluation used in Chapter 2 of the doctoral thesis by Enes Kemal Ergin.

The repository preserves the Jupyter notebooks used to benchmark Spectronaut quality scores, quantify the consequences of mass-label misassignment, and measure the impact of threshold-based filtering on peptide identification confidence in N-terminal biotin-labeled HeLa DIA proteomics experiments. It is shared as a citable research artifact for understanding the thesis analysis — not as a generalized proteomics tool or reusable software package.

What This Repository Is

A thesis-analysis snapshot of the quality-score filtering evaluation from Chapter 2.
A record of the statistical and visual analyses used to select and validate filtering thresholds for biotin-labeled DIA data.
A set of HeLa analysis notebooks with frozen outputs from the thesis.
A small library of reusable helper modules (src/) that supported the notebook workflows.

What This Repository Is Not

A packaged software library or command-line tool.
A general-purpose pipeline for arbitrary DIA quality filtering or peptide-level data cleaning.
A fully self-contained rerun from a fresh clone — raw and processed data files are not distributed with this repository.
A validated or generalized framework for non-Spectronaut or non-HeLa workflows.

Scientific Overview

In biotin-based N-terminomics, cell-surface or N-terminal peptides are selectively enriched via NHS ester chemistry and identified by their biotin modification. Two sources of analytical error are directly evaluated in this repository.

1. Mass-label accuracy

Spectronaut search results using the correct biotin mass (UniMod:3, +226.078 Da) are compared against results generated with an incorrect mass assignment (UniMod:92, +339.162 Da). The wrong-mass search substantially degrades N-terminal annotation accuracy — from approximately 65% to approximately 35% — directly motivating careful search-parameter selection and serving as a cautionary baseline for the filtering work.

2. Quality-score benchmarking

Eight precursor-level quality scores available in Spectronaut exports are evaluated as candidate filters for removing false-positive peptide identifications. All are output per elution group (EG) or fragment group (FG) by Spectronaut and are available in standard report exports.

Score	What it measures	Practical behaviour
`FG.ShapeQualityScore (MS1)`	Similarity between the observed MS1 extracted-ion chromatogram (XIC) shape and the expected elution profile	Range 0–1; higher = better peak shape in the precursor channel. Selected threshold: 0.45
`FG.ShapeQualityScore (MS2)`	Same shape-similarity metric applied to the MS2 fragment XIC	Range 0–1; complements the MS1 shape score; slightly less discriminating for biotin-labeled data
`FG.ShapeQualityScore`	Combined (aggregate) fragment-group shape quality across both MS1 and MS2 channels	Range 0–1; summarises overall chromatographic fidelity
`EG.NormalizedCscore`	Spectronaut's confidence score normalised within a run to account for systematic score shifts	Higher = more confident elution group assignment; run-normalised, so suitable for cross-run comparison
`EG.Cscore`	Raw elution-group confidence score from Spectronaut's target-decoy scoring model	Higher = more confident; not normalised across runs
`EG.SignalToNoise`	Ratio of peak signal to local baseline noise in the elution window	Higher = cleaner signal; sensitive to background complexity and peak integration window quality
`EG.Noise`	Absolute noise level estimate in the elution window	Lower = cleaner signal; useful as a secondary filter but exhibits high instrument-to-instrument variability
`EG.IntCorrScore`	Pearson correlation between observed and predicted fragment intensity patterns within the elution group	Range −1 to 1; higher = better agreement with spectral library; rewards co-elution fidelity

Each score is evaluated using ROC curve analysis, Youden's J-score for optimal threshold selection, and distributional divergence tests (Jensen–Shannon divergence, Kolmogorov–Smirnov, Wasserstein distance). FG.ShapeQualityScore (MS1) at a threshold of 0.45 consistently emerges as the most effective single filter for separating confidently identified biotin-labeled peptides from false positives.

3. Filtering impact

Applying the selected threshold reduces peptide-level variability (coefficient of variation across replicates), preserves N-terminal labeling specificity, and improves protein-level detection consistency — confirming that the chosen score and threshold are both statistically and biologically motivated.

Repository Structure

FilterFragment/
├── HeLa/                   # Thesis notebooks with frozen outputs
│   ├── ImpactOfFiltering.ipynb
│   ├── compare_wrongMass.ipynb
│   └── scoreComparison.ipynb
├── src/                    # Helper modules used by the notebooks
│   ├── utils.py
│   └── plots.py
├── LICENSE
├── CITATION.cff
└── README.md

HeLa Analysis Notebooks

The HeLa/ directory contains the three core analysis notebooks from the thesis chapter. All notebooks are preserved with frozen outputs and should be interpreted as archival research artifacts.

Notebook	Description
`ImpactOfFiltering.ipynb`	Evaluates the effect of `FG.ShapeQualityScore (MS1)` threshold filtering on peptide counts, protein detection, CV distributions across replicates, and biotin enrichment specificity (terminus vs. lysine vs. both)
`compare_wrongMass.ipynb`	Compares peptide identification depth and N-terminal annotation accuracy between the correct-mass (UniMod:3) and wrong-mass (UniMod:92) biotin search parameters, across DirectDIA, unfiltered library, and filtered library workflows
`scoreComparison.ipynb`	Benchmarks the eight Spectronaut quality scores using ROC curves, AUC, Youden's J-score, Jensen–Shannon divergence, KS test, and Wasserstein distance; identifies the optimal score and decision threshold

All notebooks import helper functions from src/utils.py and src/plots.py.

Minimal Setup

Requires Python 3.8+.

pip install -r requirements.txt

requirements.txt pins lower-bound versions for all packages used across the notebooks and helper modules: numpy, pandas, matplotlib, seaborn, scipy, biopython, and jupyter.

Important: The notebooks depend on local data/ inputs from Spectronaut exports and FASTA reference files that are not distributed with this repository. Cells will not execute from a fresh clone without the corresponding data files.

Reproducibility Notes

The notebooks in HeLa/ are preserved with frozen outputs as part of the thesis record and can be read in full without re-executing.
Raw and processed Spectronaut data are not included in this repository.
This repository is best understood as a documented analysis snapshot, not as a one-command reproducible pipeline.
Helper utilities in src/ (FASTA parsing, CV calculation, figure export, distribution plotting) are general enough to be reused independently of the thesis data.

Limitations

The analysis is scoped to Spectronaut-specific quality scores and output column formats; other DIA software would require separate score evaluation.
Threshold recommendations are validated on biotin-enriched HeLa samples measured on a specific instrument configuration. Transferability to other sample types, labeling protocols, or acquisition settings is not established here.
The full notebook execution requires proprietary Spectronaut reports and raw FASTA reference files that are not distributed.
This is not a peer-reviewed software artifact; it reflects the analytical decisions made at the time of thesis writing.

License

All content in this repository — including source code, notebooks, README prose, helper modules, and frozen outputs — is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License.

See LICENSE for the full license text and terms.

Citation

If you draw on the analysis or conclusions from this work, please cite the doctoral thesis that this repository accompanies:

@phdthesis{ergin2024thesis,
  author      = {Ergin, Enes Kemal},
  title       = {{Computational interrogation of proteoform dynamics in pediatric cancer}},
  school      = {University of British Columbia},
  year        = {2024},
  url         = {https://open.library.ubc.ca/soa/cIRcle/collections/ubctheses/24/items/1.0448334}
}

If you specifically need to reference this repository as a code or notebook artifact:

@misc{ergin2024filterfragment,
  author       = {Ergin, Enes Kemal},
  title        = {{FilterFragment: Quality-Score Filtering Evaluation for Biotin-Labeled
                   Peptide Identification in DIA Proteomics}},
  year         = {2024},
  howpublished = {\url{https://github.com/eneskemalergin/FilterFragment}},
  note         = {Archival thesis analysis snapshot. License: CC BY-NC 4.0}
}

References

Thesis: Computational interrogation of proteoform dynamics in pediatric cancer
ORCID: Enes Kemal Ergin

Fragile peaks aligned, —
Through the gate, the true signal,
Static falls away.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FilterFragment

What This Repository Is

What This Repository Is Not

Scientific Overview

1. Mass-label accuracy

2. Quality-score benchmarking

3. Filtering impact

Repository Structure

HeLa Analysis Notebooks

Minimal Setup

Reproducibility Notes

Limitations

License

Citation

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
HeLa		HeLa
src		src
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

FilterFragment

What This Repository Is

What This Repository Is Not

Scientific Overview

1. Mass-label accuracy

2. Quality-score benchmarking

3. Filtering impact

Repository Structure

HeLa Analysis Notebooks

Minimal Setup

Reproducibility Notes

Limitations

License

Citation

References

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages