FilterLabel is an archival snapshot of the filtering workflow used for part of Chapter 2 of the doctoral thesis by Enes Kemal Ergin.
The repository preserves the Python and R scripts used to validate N-terminal biotinylation labels in spectral libraries, along with supporting notebooks and example inputs. It is shared as a citable research artifact for understanding and lightly verifying the thesis workflow, not as a generalized proteomics package.
- A thesis-analysis snapshot centered on N-terminal biotinylation filtering.
- A record of the Python and R implementations used around that workflow.
- A small set of example inputs for sanity-checking the scripts.
- A set of HeLa notebooks with frozen outputs from the thesis analysis.
- A packaged software library.
- A general-purpose framework for arbitrary modification-label processing.
- A fully self-contained rerun of all thesis notebooks from a fresh clone.
The filtering algorithm validates N-terminal biotin labels by cross-referencing fragment ion annotations with lysine positions and their modification status in each peptide:
- Select N-terminally labeled peptides by keeping peptides whose modified sequence begins with a UniMod modification such as
(UniMod:3). - Locate lysine residues in the stripped peptide sequence.
- Classify lysines as modified or unmodified based on the modification annotations.
- Validate fragment evidence:
- b-ions: lysines before the fragment boundary must be consistent with labeled positions.
- y-ions: unmodified lysines must remain within the covered y-ion region.
- No-K peptides pass immediately once they are N-terminally labeled.
- Spectronaut-specific check: measured intensity must be greater than predicted intensity divided by 10.
Precursors passing these checks are retained; the rest are removed from the library.
- The current workflow is built around N-terminal biotinylation as represented in the thesis data.
- The implementation currently assumes the label appears as a UniMod annotation at the peptide N-terminus.
- Supported library sources are Spectronaut and MSFragger exports in tabular form.
- The scripts are preserved to document the thesis workflow, so portability and generalization were not the primary design goals.
| Source | Description |
|---|---|
| Spectronaut | Spectronaut report exports (.tsv) |
| MsFragger | MSFragger spectral library files (.tsv) |
FilterLabel/
├── python/ # Python CLI implementation of the filter
│ ├── filter.py
│ └── README.md
├── r/ # R CLI implementation of the filter
│ ├── filter.R
│ └── README.md
├── src/ # Helper modules used by the notebooks
│ ├── utils.py
│ └── plots.py
├── example/ # Small example input files for verification
│ ├── MSFragger_input.tsv
│ └── Spectronaut_input.tsv
├── HeLa/ # Thesis notebooks with frozen outputs
│ ├── DDA.ipynb
│ ├── DIA_directDIA.ipynb
│ ├── DIA_filteredLibrary.ipynb
│ ├── DIA_unfilteredLibrary.ipynb
│ └── Comparison.ipynb
├── LICENSE
├── LICENSE-CODE
├── LICENSE-CONTENT
├── CITATION.cff
└── README.mdRequires Python 3.8+.
pip install numpy pandasFor the notebook helper modules and notebooks:
pip install matplotlib seaborn biopython jupyterRequires R 4.0+.
install.packages(c("dplyr", "readr", "readxl", "writexl", "stringr", "optparse"))python python/filter.py MSFragger_input.tsv MsFragger example/ --verbose
python python/filter.py Spectronaut_input.tsv Spectronaut example/ --verboseRscript r/filter.R -f MSFragger_input.tsv -s MsFragger -m example/ -v
Rscript r/filter.R -f Spectronaut_input.tsv -s Spectronaut -m example/ -vSee python/README.md and r/README.md for full argument details.
- The
example/directory is the simplest way to sanity-check the scripts. - The notebooks in
HeLa/are preserved with frozen outputs as part of the thesis record. - The full notebook workflows expect local
data/rawanddata/processedinputs that are not included in this repository. - As a result, the repository is best understood as a documented analysis snapshot with lightweight verification paths, not as a one-command full rerun of the entire thesis chapter.
The HeLa/ directory contains notebooks associated with the thesis analysis of biotin-labeled HeLa cell lysates.
| Notebook | Description |
|---|---|
DDA.ipynb |
Data-dependent acquisition analysis |
DIA_directDIA.ipynb |
DIA analysis using direct-DIA search |
DIA_unfilteredLibrary.ipynb |
DIA analysis using an unfiltered spectral library |
DIA_filteredLibrary.ipynb |
DIA analysis using a FilterLabel-processed spectral library |
Comparison.ipynb |
Cross-comparison of the four acquisition or library settings |
These notebooks import helper functions from src/utils.py and src/plots.py and should be interpreted as archival research artifacts unless otherwise noted.
- The workflow is intentionally narrow and reflects the analysis conditions used in the thesis.
- The current implementation assumes N-terminal UniMod-based labeling conventions.
- The repository does not currently ship all raw or intermediate notebook data required for full notebook reruns.
- Python and R implementations are both preserved, but their exact parity should be verified explicitly when using them for future work.
This repository uses a split-license model:
- Source code in
python/,r/, andsrc/is licensed under the MIT License. See LICENSE-CODE. - Original non-code content in this repository, including README prose, issue-draft text, and thesis-oriented notebooks or frozen outputs authored for this repository, is licensed under Creative Commons Attribution-NonCommercial 4.0 International. See LICENSE-CONTENT.
- Any third-party names, software exports, or upstream materials remain subject to their original terms where applicable.
See LICENSE for the repository-level summary.
If you use this repository or reuse the archived workflow in your work, please cite it:
@software{ergin2024filterlabel,
author = {Ergin, Enes K.},
title = {{FilterLabel: Validation of N-Terminal Biotinylation Labels in Spectral Libraries}},
year = {2024},
url = {https://github.com/eneskemalergin/FilterLabel}
}- Thesis: Computational interrogation of proteoform dynamics in pediatric cancer
- Related thesis using FilterLabel: Exploring cell surface-associated proteolytic proteoforms in acute lymphoblastic leukemia
- ORCID: Enes Kemal Ergin
Fragment ions tell the truth,
Only labeled stay.