Skip to content

eneskemalergin/FilterFragment

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FilterFragment

License: CC BY-NC 4.0 Python

FilterFragment is an archival snapshot of the quality-score filtering evaluation used in Chapter 2 of the doctoral thesis by Enes Kemal Ergin.

The repository preserves the Jupyter notebooks used to benchmark Spectronaut quality scores, quantify the consequences of mass-label misassignment, and measure the impact of threshold-based filtering on peptide identification confidence in N-terminal biotin-labeled HeLa DIA proteomics experiments. It is shared as a citable research artifact for understanding the thesis analysis — not as a generalized proteomics tool or reusable software package.


What This Repository Is

  • A thesis-analysis snapshot of the quality-score filtering evaluation from Chapter 2.
  • A record of the statistical and visual analyses used to select and validate filtering thresholds for biotin-labeled DIA data.
  • A set of HeLa analysis notebooks with frozen outputs from the thesis.
  • A small library of reusable helper modules (src/) that supported the notebook workflows.

What This Repository Is Not

  • A packaged software library or command-line tool.
  • A general-purpose pipeline for arbitrary DIA quality filtering or peptide-level data cleaning.
  • A fully self-contained rerun from a fresh clone — raw and processed data files are not distributed with this repository.
  • A validated or generalized framework for non-Spectronaut or non-HeLa workflows.

Scientific Overview

In biotin-based N-terminomics, cell-surface or N-terminal peptides are selectively enriched via NHS ester chemistry and identified by their biotin modification. Two sources of analytical error are directly evaluated in this repository.

1. Mass-label accuracy

Spectronaut search results using the correct biotin mass (UniMod:3, +226.078 Da) are compared against results generated with an incorrect mass assignment (UniMod:92, +339.162 Da). The wrong-mass search substantially degrades N-terminal annotation accuracy — from approximately 65% to approximately 35% — directly motivating careful search-parameter selection and serving as a cautionary baseline for the filtering work.

2. Quality-score benchmarking

Eight precursor-level quality scores available in Spectronaut exports are evaluated as candidate filters for removing false-positive peptide identifications. All are output per elution group (EG) or fragment group (FG) by Spectronaut and are available in standard report exports.

Score What it measures Practical behaviour
FG.ShapeQualityScore (MS1) Similarity between the observed MS1 extracted-ion chromatogram (XIC) shape and the expected elution profile Range 0–1; higher = better peak shape in the precursor channel. Selected threshold: 0.45
FG.ShapeQualityScore (MS2) Same shape-similarity metric applied to the MS2 fragment XIC Range 0–1; complements the MS1 shape score; slightly less discriminating for biotin-labeled data
FG.ShapeQualityScore Combined (aggregate) fragment-group shape quality across both MS1 and MS2 channels Range 0–1; summarises overall chromatographic fidelity
EG.NormalizedCscore Spectronaut's confidence score normalised within a run to account for systematic score shifts Higher = more confident elution group assignment; run-normalised, so suitable for cross-run comparison
EG.Cscore Raw elution-group confidence score from Spectronaut's target-decoy scoring model Higher = more confident; not normalised across runs
EG.SignalToNoise Ratio of peak signal to local baseline noise in the elution window Higher = cleaner signal; sensitive to background complexity and peak integration window quality
EG.Noise Absolute noise level estimate in the elution window Lower = cleaner signal; useful as a secondary filter but exhibits high instrument-to-instrument variability
EG.IntCorrScore Pearson correlation between observed and predicted fragment intensity patterns within the elution group Range −1 to 1; higher = better agreement with spectral library; rewards co-elution fidelity

Each score is evaluated using ROC curve analysis, Youden's J-score for optimal threshold selection, and distributional divergence tests (Jensen–Shannon divergence, Kolmogorov–Smirnov, Wasserstein distance). FG.ShapeQualityScore (MS1) at a threshold of 0.45 consistently emerges as the most effective single filter for separating confidently identified biotin-labeled peptides from false positives.

3. Filtering impact

Applying the selected threshold reduces peptide-level variability (coefficient of variation across replicates), preserves N-terminal labeling specificity, and improves protein-level detection consistency — confirming that the chosen score and threshold are both statistically and biologically motivated.


Repository Structure

FilterFragment/
├── HeLa/                   # Thesis notebooks with frozen outputs
│   ├── ImpactOfFiltering.ipynb
│   ├── compare_wrongMass.ipynb
│   └── scoreComparison.ipynb
├── src/                    # Helper modules used by the notebooks
│   ├── utils.py
│   └── plots.py
├── LICENSE
├── CITATION.cff
└── README.md

HeLa Analysis Notebooks

The HeLa/ directory contains the three core analysis notebooks from the thesis chapter. All notebooks are preserved with frozen outputs and should be interpreted as archival research artifacts.

Notebook Description
ImpactOfFiltering.ipynb Evaluates the effect of FG.ShapeQualityScore (MS1) threshold filtering on peptide counts, protein detection, CV distributions across replicates, and biotin enrichment specificity (terminus vs. lysine vs. both)
compare_wrongMass.ipynb Compares peptide identification depth and N-terminal annotation accuracy between the correct-mass (UniMod:3) and wrong-mass (UniMod:92) biotin search parameters, across DirectDIA, unfiltered library, and filtered library workflows
scoreComparison.ipynb Benchmarks the eight Spectronaut quality scores using ROC curves, AUC, Youden's J-score, Jensen–Shannon divergence, KS test, and Wasserstein distance; identifies the optimal score and decision threshold

All notebooks import helper functions from src/utils.py and src/plots.py.


Minimal Setup

Requires Python 3.8+.

pip install -r requirements.txt

requirements.txt pins lower-bound versions for all packages used across the notebooks and helper modules: numpy, pandas, matplotlib, seaborn, scipy, biopython, and jupyter.

Important: The notebooks depend on local data/ inputs from Spectronaut exports and FASTA reference files that are not distributed with this repository. Cells will not execute from a fresh clone without the corresponding data files.


Reproducibility Notes

  • The notebooks in HeLa/ are preserved with frozen outputs as part of the thesis record and can be read in full without re-executing.
  • Raw and processed Spectronaut data are not included in this repository.
  • This repository is best understood as a documented analysis snapshot, not as a one-command reproducible pipeline.
  • Helper utilities in src/ (FASTA parsing, CV calculation, figure export, distribution plotting) are general enough to be reused independently of the thesis data.

Limitations

  • The analysis is scoped to Spectronaut-specific quality scores and output column formats; other DIA software would require separate score evaluation.
  • Threshold recommendations are validated on biotin-enriched HeLa samples measured on a specific instrument configuration. Transferability to other sample types, labeling protocols, or acquisition settings is not established here.
  • The full notebook execution requires proprietary Spectronaut reports and raw FASTA reference files that are not distributed.
  • This is not a peer-reviewed software artifact; it reflects the analytical decisions made at the time of thesis writing.

License

All content in this repository — including source code, notebooks, README prose, helper modules, and frozen outputs — is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License.

See LICENSE for the full license text and terms.


Citation

If you draw on the analysis or conclusions from this work, please cite the doctoral thesis that this repository accompanies:

@phdthesis{ergin2024thesis,
  author      = {Ergin, Enes Kemal},
  title       = {{Computational interrogation of proteoform dynamics in pediatric cancer}},
  school      = {University of British Columbia},
  year        = {2024},
  url         = {https://open.library.ubc.ca/soa/cIRcle/collections/ubctheses/24/items/1.0448334}
}

If you specifically need to reference this repository as a code or notebook artifact:

@misc{ergin2024filterfragment,
  author       = {Ergin, Enes Kemal},
  title        = {{FilterFragment: Quality-Score Filtering Evaluation for Biotin-Labeled
                   Peptide Identification in DIA Proteomics}},
  year         = {2024},
  howpublished = {\url{https://github.com/eneskemalergin/FilterFragment}},
  note         = {Archival thesis analysis snapshot. License: CC BY-NC 4.0}
}

References


Fragile peaks aligned, —
Through the gate, the true signal,
Static falls away.

About

Quality-score filtering evaluation for biotin-labeled peptide identification in Spectronaut DIA proteomics

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors