Commit d3c43c0
feat!(website, prepro, backend, config, integration):multi pathogen - refactor multi segment submission in backend and edit page and have prepro assign segments (#5382)
resolves #4999 #4708,
#4734,
#5511
partially resolves
#5392,
#5185 (comment)
includes work done in
#5398 and
#5402
This PR additionally fixes submission, subtype assignment and search for
EVs and other multi-path organisms.
### BREAKING CHANGES
When users submit to multi-segmented organisms and want to group
multiple segments under one metadata entry they are now required to add
an additional `fastaIds` column with a space -separated list of the
`fastaId`s (fasta header IDs) of the respective sequences. If no
`fastaIds` column is supplied the `submissionId` will be used instead
and the backend will assume that (as in the single-segmented case) there
is a one-to-one mapping of metadata `submissionId` to `fastaId`.
This new submission structure was voted for in microbioinfo:
https://microbial-bioinfo.slack.com/archives/CB0HYT53M/p1760961465729399
and discussed in
https://app.nuclino.com/Loculus/Development/2025-10-20-Weekly-6d5fe89f-8ded-4286-b892-d215e0a498f6
(and in other meetings)
Nextclade sort (uses a minimizer index for fast local alignment) or
nextclade align (full sequence alignment to reference) will be used to
assign segments/subtypes for all multi-segmented and multi-pathogen
sequences (this is also done in ingest for grouping segments):
```
segment_classification_method: "minimizer" or "align"
minimizer_url: <url_to_minimizer_index_used_by_nextclade_sort>
```
For organisms without a nextclade dataset we still allow the fasta
headers to be used to determine the segment/subtype - entries must have
the format `<submissionId>_<segmentName>` (as in current set up).
As preprocessing now assigns segments it will return a map from the
segment (or subtype) to the fastaId in the processedData, the map is
called: `sequenceNameToFastaId`. This allows us to surface the segment
assignment on the edit page.
### Nextclade Preprocessing pipeline config changes
Instead of having a dictionary for the nextclade datasets and servers we
make `nucleotideSequences` a dictionary where each item includes all
information required to run nextclade. I.e. we change from:
```
nextclade_dataset_name:
L: nextstrain/cchfv/linked/L
M: nextstrain/cchfv/linked/M
S: nextstrain/cchfv/linked/S
nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output
genes: [RdRp, GPC, NP]
```
to:
```
nextclade_sequence_and_datasets:
- name: L
nextclade_dataset_name: nextstrain/cchfv/linked/L
nextclade_dataset_tag: <optional - was previously incorrectly placed on an organism level>
nextclade_dataset_server: <optional overwrites nextclade_dataset_server for this seq>
accepted_sort_matches: <optional, used for classify_with_nextclade_sort and require_nextclade_sort_match, if not given nextclade_dataset_name and name are used>
gene_prefix: <optional, prefix to add to genes produced by nextclade run, e.g. nextclade labels genes as `AV1` but we expect `EV1_AV1`, here `EV1` would be the prefix >
genes: [RdRp]
- name: M
nextclade_dataset_name: nextstrain/cchfv/linked/M
genes: [GPC]
- name: S
nextclade_dataset_name: nextstrain/cchfv/linked/S
genes: [NP]
nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output
segment_classification_method: <optional, default for multi segmented viruses is align - if you assign segments in ingest for grouping use the same option here as you use there e.g. "minimizer" or "align">
minimizer_url: <optional, url_to_minimizer_index_used_by_nextclade_sort>
```
### Ingest Pipeline Config changes
`minimizer_index` is changed to `minimizer_url` for consistency (can be
used in ingest and preprocessing and should both be the same)
### Optional additional Config changes
Limit the number of sequences the backend will accept per submission by
using - should be added for multi-segmented organisms:
`
submissionDataTypes: &defaultSubmissionDataTypes
consensusSequences: true
maxSequencesPerEntry: 1
`
### Testing
You can use pathoplexus/example_data#16 and
pathoplexus/dev_example_data#2 for testing.
### PR Checklist
- [x] Update values.schema.json and other READMEs
- [x] add fastaId to commonMetadata (ensure it is downloaded in
templates): #5561
- [x] Fix how genes are returned (will cause a config update):
#5563
- [x] Improve prepro code (less duplication and more tests):
#5554
- [x] ingest EVs as single segmented to ensure search works:
#5511
- [x] keep tests for alignment NONE case
- [x] Create a minimizer for tests using:
https://github.com/loculus-project/nextclade-sort-minimizer-creator
- [x] Any manual testing that has been done is documented: submission of
EVs from test folder were submitted with the same fastaHeader as the
submissionId -> this succeeded, additionally the submission of CCHF with
a fastaID column in the metadata was tested (also in folder above),
additionally revision of a segment was tested
- [x] Have preprocessing send back a segment: fastaHeader mapping
- ~add integration testing for full EV submission user journey~ -> will
be done in a later PR
- [x] improve CCHF minimizer (some segments are again not assigned)
- [x] discuss if the originalData dictionary should be migrated
(persistent DB has segmentName as key, now we have fastaHeader as key)
-> decided against
- [x] update PPX docs with new multi-segment submission format -> test
PR here: pathoplexus/pathoplexus#759
- [x] update example data for demo
🚀 Preview: https://edit-page-anya.loculus.org
---------
Co-authored-by: Cornelius Roemer <cornelius.roemer@gmail.com>
Co-authored-by: Fabian Engelniederhammer <92720311+fengelniederhammer@users.noreply.github.com>
Co-authored-by: Theo Sanderson <theo@sndrsn.co.uk>1 parent afac5f3 commit d3c43c0
113 files changed
Lines changed: 3193 additions & 1166 deletions
File tree
- backend
- docs/db
- src
- main
- kotlin/org/loculus/backend
- api
- config
- controller
- model
- service/submission
- dbtables
- utils
- resources/db/migration
- test
- kotlin/org/loculus/backend
- controller/submission
- service
- utils
- resources
- docs/src/content/docs
- for-administrators
- for-users
- reference
- ingest
- scripts
- tests/expected_output_cchf
- integration-tests/tests
- fixtures
- pages
- specs
- cli
- features
- search
- test-data
- test-helpers
- kubernetes/loculus
- templates
- preprocessing
- dummy
- nextclade
- src/loculus_preprocessing
- tests
- ebola-dataset
- ebola-sudan
- main
- minimizer
- ebola-zaire/main
- ebola-multipath-dataset
- ebola-sudan
- ebola-zaire
- minimizer
- website
- src
- components
- Edit
- ReviewPage
- Submission
- FileUpload
- pages/[organism]
- metadata-overview
- submission/edit/[accession]
- types
Some content is hidden
Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | 1 | | |
2 | 2 | | |
3 | | - | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
4 | 8 | | |
5 | 9 | | |
6 | 10 | | |
| |||
15 | 19 | | |
16 | 20 | | |
17 | 21 | | |
18 | | - | |
| 22 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
378 | 378 | | |
379 | 379 | | |
380 | 380 | | |
381 | | - | |
| 381 | + | |
| 382 | + | |
382 | 383 | | |
383 | 384 | | |
384 | 385 | | |
| |||
538 | 539 | | |
539 | 540 | | |
540 | 541 | | |
541 | | - | |
542 | | - | |
543 | | - | |
| 542 | + | |
| 543 | + | |
544 | 544 | | |
545 | 545 | | |
546 | 546 | | |
| |||
753 | 753 | | |
754 | 754 | | |
755 | 755 | | |
756 | | - | |
| 756 | + | |
757 | 757 | | |
758 | 758 | | |
759 | 759 | | |
| |||
Lines changed: 8 additions & 2 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
8 | 8 | | |
9 | 9 | | |
10 | 10 | | |
| 11 | + | |
11 | 12 | | |
12 | 13 | | |
13 | 14 | | |
| |||
166 | 167 | | |
167 | 168 | | |
168 | 169 | | |
| 170 | + | |
| 171 | + | |
| 172 | + | |
| 173 | + | |
| 174 | + | |
169 | 175 | | |
170 | 176 | | |
171 | 177 | | |
| |||
300 | 306 | | |
301 | 307 | | |
302 | 308 | | |
303 | | - | |
| 309 | + | |
304 | 310 | | |
305 | | - | |
| 311 | + | |
306 | 312 | | |
307 | 313 | | |
308 | 314 | | |
| |||
Lines changed: 3 additions & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
51 | 51 | | |
52 | 52 | | |
53 | 53 | | |
54 | | - | |
| 54 | + | |
55 | 55 | | |
56 | 56 | | |
57 | 57 | | |
58 | 58 | | |
| 59 | + | |
| 60 | + | |
59 | 61 | | |
60 | 62 | | |
61 | 63 | | |
| |||
Lines changed: 1 addition & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
24 | 24 | | |
25 | 25 | | |
26 | 26 | | |
27 | | - | |
| 27 | + | |
28 | 28 | | |
29 | 29 | | |
30 | 30 | | |
| |||
Lines changed: 11 additions & 7 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | 1 | | |
2 | 2 | | |
3 | | - | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
4 | 6 | | |
5 | 7 | | |
6 | 8 | | |
7 | | - | |
| 9 | + | |
8 | 10 | | |
9 | 11 | | |
10 | | - | |
| 12 | + | |
11 | 13 | | |
12 | 14 | | |
13 | 15 | | |
| |||
18 | 20 | | |
19 | 21 | | |
20 | 22 | | |
21 | | - | |
| 23 | + | |
22 | 24 | | |
23 | 25 | | |
| 26 | + | |
24 | 27 | | |
25 | 28 | | |
26 | 29 | | |
27 | 30 | | |
28 | | - | |
| 31 | + | |
29 | 32 | | |
30 | | - | |
| 33 | + | |
| 34 | + | |
31 | 35 | | |
32 | 36 | | |
33 | 37 | | |
| |||
114 | 118 | | |
115 | 119 | | |
116 | 120 | | |
117 | | - | |
| 121 | + | |
118 | 122 | | |
119 | 123 | | |
120 | 124 | | |
| |||
Lines changed: 47 additions & 34 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
14 | 14 | | |
15 | 15 | | |
16 | 16 | | |
17 | | - | |
18 | 17 | | |
19 | | - | |
20 | | - | |
21 | 18 | | |
22 | 19 | | |
23 | 20 | | |
| |||
31 | 28 | | |
32 | 29 | | |
33 | 30 | | |
34 | | - | |
35 | | - | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
36 | 35 | | |
37 | 36 | | |
38 | 37 | | |
39 | 38 | | |
40 | 39 | | |
| 40 | + | |
41 | 41 | | |
42 | 42 | | |
43 | 43 | | |
| |||
85 | 85 | | |
86 | 86 | | |
87 | 87 | | |
88 | | - | |
89 | 88 | | |
90 | 89 | | |
91 | 90 | | |
| |||
126 | 125 | | |
127 | 126 | | |
128 | 127 | | |
129 | | - | |
130 | | - | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
131 | 135 | | |
132 | 136 | | |
133 | 137 | | |
| |||
167 | 171 | | |
168 | 172 | | |
169 | 173 | | |
| 174 | + | |
170 | 175 | | |
171 | 176 | | |
172 | 177 | | |
| |||
175 | 180 | | |
176 | 181 | | |
177 | 182 | | |
178 | | - | |
| 183 | + | |
179 | 184 | | |
180 | 185 | | |
181 | 186 | | |
182 | 187 | | |
183 | | - | |
184 | | - | |
185 | | - | |
186 | | - | |
187 | | - | |
188 | | - | |
| 188 | + | |
| 189 | + | |
| 190 | + | |
| 191 | + | |
| 192 | + | |
| 193 | + | |
| 194 | + | |
189 | 195 | | |
190 | | - | |
191 | | - | |
192 | | - | |
193 | | - | |
194 | | - | |
195 | | - | |
196 | | - | |
197 | | - | |
198 | | - | |
199 | | - | |
200 | | - | |
201 | | - | |
| 196 | + | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
| 200 | + | |
| 201 | + | |
| 202 | + | |
| 203 | + | |
| 204 | + | |
| 205 | + | |
| 206 | + | |
202 | 207 | | |
203 | 208 | | |
204 | 209 | | |
| |||
250 | 255 | | |
251 | 256 | | |
252 | 257 | | |
| 258 | + | |
| 259 | + | |
| 260 | + | |
| 261 | + | |
| 262 | + | |
253 | 263 | | |
254 | 264 | | |
255 | 265 | | |
256 | | - | |
| 266 | + | |
257 | 267 | | |
258 | 268 | | |
259 | 269 | | |
| |||
269 | 279 | | |
270 | 280 | | |
271 | 281 | | |
272 | | - | |
| 282 | + | |
273 | 283 | | |
274 | 284 | | |
275 | 285 | | |
| |||
344 | 354 | | |
345 | 355 | | |
346 | 356 | | |
347 | | - | |
348 | | - | |
| 357 | + | |
| 358 | + | |
| 359 | + | |
| 360 | + | |
349 | 361 | | |
350 | 362 | | |
351 | 363 | | |
352 | 364 | | |
353 | | - | |
354 | | - | |
| 365 | + | |
| 366 | + | |
| 367 | + | |
355 | 368 | | |
356 | 369 | | |
357 | 370 | | |
| |||
364 | 377 | | |
365 | 378 | | |
366 | 379 | | |
367 | | - | |
| 380 | + | |
368 | 381 | | |
369 | 382 | | |
370 | 383 | | |
| |||
Lines changed: 2 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
102 | 102 | | |
103 | 103 | | |
104 | 104 | | |
| 105 | + | |
105 | 106 | | |
106 | 107 | | |
107 | 108 | | |
| |||
128 | 129 | | |
129 | 130 | | |
130 | 131 | | |
| 132 | + | |
131 | 133 | | |
132 | 134 | | |
133 | 135 | | |
| |||
Lines changed: 1 addition & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
20 | 20 | | |
21 | 21 | | |
22 | 22 | | |
| 23 | + | |
23 | 24 | | |
24 | 25 | | |
25 | 26 | | |
| |||
0 commit comments