Skip to content

Commit ec82cf5

Browse files
docs: updated readme with new preprocessing step and incremented mid-tier versioning (1.1.0)
1 parent 4001dcc commit ec82cf5

2 files changed

Lines changed: 89 additions & 3 deletions

File tree

README.md

Lines changed: 87 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -49,7 +49,7 @@ The repository profile is the source of truth for reproducing experiments end-to
4949
## End-to-End Pipeline
5050

5151
```text
52-
Stage 1 preprocess metadata + zscores
52+
Stage 1 normalize dataset inputs (PV1/CWP/BKP) to a shared training contract
5353
Stage 2 generate ESM-2 per-residue embeddings
5454
Stage 3 build residue-level label shards
5555
Stage 4 train FFNN (seeded or ensemble-kfold, DDP-aware)
@@ -60,7 +60,91 @@ Stage 7 evaluate residue metrics (+ optional Cocci peptide compare)
6060

6161
## Stage Reference
6262

63-
### Stage 1: Preprocess
63+
### Stage 1: Multi-Dataset Prepare (PV1/CWP/BKP)
64+
65+
**CLI:** `pepseqpred-prepare-dataset` (`src/pepseqpred/apps/prepare_dataset_cli.py`)
66+
67+
This stage is the recommended entrypoint when training on one or more of:
68+
69+
- PV1 (human virome)
70+
- CWP/Cocci (fungal)
71+
- BKP (bacterial)
72+
73+
It normalizes source-specific metadata and FASTA headers into a shared PV1-compatible contract so downstream embedding, label generation, and training CLIs can be reused unchanged.
74+
75+
**Core module**
76+
77+
- `src/pepseqpred/core/preprocess/preparedataset.py`
78+
79+
**Required output contract per dataset**
80+
81+
- `prepared_targets.fasta`
82+
- `prepared_labels_metadata.tsv`
83+
- `prepared_embedding_metadata.tsv`
84+
- `prepare_summary.json`
85+
86+
**PV1 inputs and command**
87+
88+
- metadata TSV
89+
- z-score TSV
90+
- protein FASTA
91+
92+
```bash
93+
pepseqpred-prepare-dataset \
94+
localdata/PV1/PV1_meta_2020-11-23_cleaned.tsv \
95+
localdata/PV1/prepared \
96+
--dataset-kind pv1 \
97+
--protein-fasta localdata/PV1/PV1_targets.fasta \
98+
--z-file localdata/PV1/PV1_zscores.tsv
99+
```
100+
101+
**CWP/Cocci inputs and command**
102+
103+
- metadata TSV
104+
- protein FASTA
105+
- reactive code list TSV
106+
- non-reactive code list TSV
107+
108+
```bash
109+
pepseqpred-prepare-dataset \
110+
localdata/Cocci/CWP_metadata.tsv \
111+
localdata/Cocci/prepared \
112+
--dataset-kind cwp \
113+
--protein-fasta localdata/Cocci/CWP_targets.faa \
114+
--reactive-codes localdata/Cocci/CWP_reactive_Z20N4.tsv \
115+
--nonreactive-codes localdata/Cocci/CWP_nonreactive_Z20N4.tsv
116+
```
117+
118+
**BKP inputs and command**
119+
120+
- metadata TSV
121+
- protein FASTA
122+
- reactive code list TSV
123+
- non-reactive code list TSV
124+
125+
```bash
126+
pepseqpred-prepare-dataset \
127+
localdata/BKP/BKP_metadata.tsv \
128+
localdata/BKP/prepared \
129+
--dataset-kind bkp \
130+
--protein-fasta localdata/BKP/BKP.faa \
131+
--reactive-codes localdata/BKP/BKP_reactive_Z20N4.tsv \
132+
--nonreactive-codes localdata/BKP/BKP_nonreactive_Z20N4.tsv
133+
```
134+
135+
**Dataset-specific grouping used for leakage-aware splitting (`--split-type id-family`)**
136+
137+
- PV1: family from PV1 `OXX`
138+
- CWP/Cocci: `Cluster50ID` mapped to deterministic numeric IDs
139+
- BKP: `reClusterID_70` mapped to deterministic numeric IDs
140+
141+
**Next stages after prepare**
142+
143+
- run `pepseqpred-esm` with `--embedding-key-mode id-family` and each dataset's `prepared_embedding_metadata.tsv`
144+
- run `pepseqpred-labels` with `--embedding-key-delim -`
145+
- train with `--split-type id-family`
146+
147+
### Stage 1 (Legacy): PV1 Z-Score Preprocess
64148

65149
**CLI:** `pepseqpred-preprocess` (`src/pepseqpred/apps/preprocess_cli.py`)
66150

@@ -337,6 +421,7 @@ Bundled pretrained registry currently includes:
337421

338422
| CLI | File | Purpose |
339423
| --- | --- | --- |
424+
| `pepseqpred-prepare-dataset` | `apps/prepare_dataset_cli.py` | normalize PV1/CWP/BKP into shared training contract |
340425
| `pepseqpred-preprocess` | `apps/preprocess_cli.py` | metadata + z-score preprocessing |
341426
| `pepseqpred-esm` | `apps/esm_cli.py` | ESM-2 embedding generation |
342427
| `pepseqpred-labels` | `apps/labels_cli.py` | residue label shard generation |

pyproject.toml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
44

55
[project]
66
name = "pepseqpred"
7-
version = "1.0.0"
7+
version = "1.1.0"
88
description = "Residue-level epitope prediction pipeline for peptide/protein workflows."
99
readme = "README.pypi.md"
1010
requires-python = ">=3.12"
@@ -72,6 +72,7 @@ pepseqpred-esm = "pepseqpred.apps.esm_cli:main"
7272
pepseqpred-labels = "pepseqpred.apps.labels_cli:main"
7373
pepseqpred-predict = "pepseqpred.apps.prediction_cli:main"
7474
pepseqpred-preprocess = "pepseqpred.apps.preprocess_cli:main"
75+
pepseqpred-prepare-dataset = "pepseqpred.apps.prepare_dataset_cli:main"
7576
pepseqpred-eval-ffnn = "pepseqpred.apps.evaluate_ffnn_cli:main"
7677
pepseqpred-train-ffnn = "pepseqpred.apps.train_ffnn_cli:main"
7778
pepseqpred-train-ffnn-optuna = "pepseqpred.apps.train_ffnn_optuna_cli:main"

0 commit comments

Comments
 (0)