@@ -49,7 +49,7 @@ The repository profile is the source of truth for reproducing experiments end-to
4949## End-to-End Pipeline
5050
5151``` text
52- Stage 1 preprocess metadata + zscores
52+ Stage 1 normalize dataset inputs (PV1/CWP/BKP) to a shared training contract
5353Stage 2 generate ESM-2 per-residue embeddings
5454Stage 3 build residue-level label shards
5555Stage 4 train FFNN (seeded or ensemble-kfold, DDP-aware)
@@ -60,7 +60,91 @@ Stage 7 evaluate residue metrics (+ optional Cocci peptide compare)
6060
6161## Stage Reference
6262
63- ### Stage 1: Preprocess
63+ ### Stage 1: Multi-Dataset Prepare (PV1/CWP/BKP)
64+
65+ ** CLI:** ` pepseqpred-prepare-dataset ` (` src/pepseqpred/apps/prepare_dataset_cli.py ` )
66+
67+ This stage is the recommended entrypoint when training on one or more of:
68+
69+ - PV1 (human virome)
70+ - CWP/Cocci (fungal)
71+ - BKP (bacterial)
72+
73+ It normalizes source-specific metadata and FASTA headers into a shared PV1-compatible contract so downstream embedding, label generation, and training CLIs can be reused unchanged.
74+
75+ ** Core module**
76+
77+ - ` src/pepseqpred/core/preprocess/preparedataset.py `
78+
79+ ** Required output contract per dataset**
80+
81+ - ` prepared_targets.fasta `
82+ - ` prepared_labels_metadata.tsv `
83+ - ` prepared_embedding_metadata.tsv `
84+ - ` prepare_summary.json `
85+
86+ ** PV1 inputs and command**
87+
88+ - metadata TSV
89+ - z-score TSV
90+ - protein FASTA
91+
92+ ``` bash
93+ pepseqpred-prepare-dataset \
94+ localdata/PV1/PV1_meta_2020-11-23_cleaned.tsv \
95+ localdata/PV1/prepared \
96+ --dataset-kind pv1 \
97+ --protein-fasta localdata/PV1/PV1_targets.fasta \
98+ --z-file localdata/PV1/PV1_zscores.tsv
99+ ```
100+
101+ ** CWP/Cocci inputs and command**
102+
103+ - metadata TSV
104+ - protein FASTA
105+ - reactive code list TSV
106+ - non-reactive code list TSV
107+
108+ ``` bash
109+ pepseqpred-prepare-dataset \
110+ localdata/Cocci/CWP_metadata.tsv \
111+ localdata/Cocci/prepared \
112+ --dataset-kind cwp \
113+ --protein-fasta localdata/Cocci/CWP_targets.faa \
114+ --reactive-codes localdata/Cocci/CWP_reactive_Z20N4.tsv \
115+ --nonreactive-codes localdata/Cocci/CWP_nonreactive_Z20N4.tsv
116+ ```
117+
118+ ** BKP inputs and command**
119+
120+ - metadata TSV
121+ - protein FASTA
122+ - reactive code list TSV
123+ - non-reactive code list TSV
124+
125+ ``` bash
126+ pepseqpred-prepare-dataset \
127+ localdata/BKP/BKP_metadata.tsv \
128+ localdata/BKP/prepared \
129+ --dataset-kind bkp \
130+ --protein-fasta localdata/BKP/BKP.faa \
131+ --reactive-codes localdata/BKP/BKP_reactive_Z20N4.tsv \
132+ --nonreactive-codes localdata/BKP/BKP_nonreactive_Z20N4.tsv
133+ ```
134+
135+ ** Dataset-specific grouping used for leakage-aware splitting (` --split-type id-family ` )**
136+
137+ - PV1: family from PV1 ` OXX `
138+ - CWP/Cocci: ` Cluster50ID ` mapped to deterministic numeric IDs
139+ - BKP: ` reClusterID_70 ` mapped to deterministic numeric IDs
140+
141+ ** Next stages after prepare**
142+
143+ - run ` pepseqpred-esm ` with ` --embedding-key-mode id-family ` and each dataset's ` prepared_embedding_metadata.tsv `
144+ - run ` pepseqpred-labels ` with ` --embedding-key-delim - `
145+ - train with ` --split-type id-family `
146+
147+ ### Stage 1 (Legacy): PV1 Z-Score Preprocess
64148
65149** CLI:** ` pepseqpred-preprocess ` (` src/pepseqpred/apps/preprocess_cli.py ` )
66150
@@ -337,6 +421,7 @@ Bundled pretrained registry currently includes:
337421
338422| CLI | File | Purpose |
339423| --- | --- | --- |
424+ | ` pepseqpred-prepare-dataset ` | ` apps/prepare_dataset_cli.py ` | normalize PV1/CWP/BKP into shared training contract |
340425| ` pepseqpred-preprocess ` | ` apps/preprocess_cli.py ` | metadata + z-score preprocessing |
341426| ` pepseqpred-esm ` | ` apps/esm_cli.py ` | ESM-2 embedding generation |
342427| ` pepseqpred-labels ` | ` apps/labels_cli.py ` | residue label shard generation |
0 commit comments