v0.31.0: S3 staging + Peptide types + sequence-derived rescore features by jspaezp · Pull Request #60 · TalusBio/timsbuktoolkit

jspaezp · 2026-04-22T04:07:27Z

Summary

Three concurrent threads landing as v0.31.0:

S3 staging (tims_stage crate)

New tims_stage crate: URI resolution, open_reader, upload_file, staging backend, tar walker, prefix materializer, sweep.
timsseek_cli and speclib_build_cli accept S3 URIs for inputs and outputs.
Layer-0 I/O primitives promoted out of timscentroid; TimedStep moved down a layer.
Perf: parallel idx load, streaming tar payload fetch (see load_bench example).
aws is now a default cargo feature.
MinIO smoke test + S3 usage notes.

Peptide / ParsedSequence types

New Peptide + ParsedSequence; parse_sequence bridges rustyms.
normalize_to_proforma handles DIA-NN / short-form strings (shared).
Threaded through speclib, scoring, parquet writer (ScoringFields carries Peptide, not String).
DIA-NN loader uses modified_peptide for Peptide.raw.
All-or-none parse gate; parsable_sequences flag emitted into parquet metadata.
DigestSlice -> ProteinSlice rename; dead decoy helpers / NonReversedDecoy removed.
22 sequence-derived rescore features (gated on parse success).

Feature stats + tracing

Per-fold feature means + gain importance sidecar.
init_tracing split; tracing to file by default, per-file timing, real output URI.
Span events emitted as jsonl sibling; scripts/plot_spans.py for gantt plots.
~ expansion in local URIs before filesystem probes.

Release / build

Workspace 0.30.0 -> 0.31.0 (Cargo + pyproject).
Release binary 90MB -> 25MB: rustyms gnome.dat stub via patch-crate (jspaezp fork), strip = "symbols", arrow default features dropped.
timsquery_pyo3 extension-module behind a cargo feature (fixes downstream linkage).
DATA_FLOW.md updated.

Test plan

cargo build (default + --no-default-features) green on HEAD b71e54a
MinIO smoke test for S3 paths
End-to-end run on real dataset before tag
Tag v0.31.0 after merge

Add StorageProvider::open (non-creating), head, exists, get_bytes, range_get (with short-read detection), list_capped, and get_to_file. Route write_bytes through object_store::buffered::BufWriter so single- PUT vs multipart upload is automatic. Add SerializationError:: PrefixCapExceeded variant for capped listing errors. Foundation for the upcoming tims_stage crate. Drive-by: import ProgressFinish in speclib_build_cli/pipeline.rs (missing since ad3c0c3 introduced the usage) to unblock the workspace build. task fmt run; includes whitespace-only updates to timsseek_cli/processing.rs.

Relocate TimedStep + LABEL_WIDTH from timsseek::scoring::timings into timscentroid so the upcoming tims_stage crate can reuse it without a layer inversion. timsseek's scoring::timings keeps its scoring-specific report types (ScoreTimings, PrescoreTimings, PipelineReport, etc.) and re-exports TimedStep from timscentroid so existing caller paths (timsseek::TimedStep, timsseek::scoring::TimedStep, timsseek::scoring::timings::TimedStep) all continue to work.

New workspace crate housing Layers 1 (resolution) and 2 (staging policy) from the S3 staging design. Empty module stubs + full StageError enum + redact_uri helper (strips query strings so presigned URLs don't leak into error messages). Depends on timscentroid for Layer-0 I/O primitives; no direct tokio/aws deps. Feature flags aws/gcp/azure pass through to timscentroid.

Fill the scaffolded modules: - uri: parse_uri_shape (LocKind + NameKind), sidecar_of, split_uri (unified splitter used by every remote call site), canonical_uri (trailing-slash-stripped, case-sensitive for S3, canonicalize for local). - resolve: Resolved enum (LocalIdx/RemoteIdx/LocalDotD/Stageable) + SourceSpec (S3Tar/S3Prefix/LocalTar) + resolve() decision tree. Sidecar probe HEADs `<sidecar_uri>/metadata.json` because the `.idx` is a directory, not a blob. Uses StorageProvider::open so resolution never materializes local directories as a side effect. - open: open_reader(uri) returns Box<dyn Read + Send>. Remote URIs are buffered whole into Cursor<Bytes> to keep rayon-safety. - upload: upload_file(local, dest_uri). Local -> local is direct; local -> remote routes through StorageProvider::new (BufWriter). Errors wrap transport failures with redact_uri() so presigned-URL query strings never leak into user-visible messages.

…weep Layer 2 landed: - StagingBackend trait + StagedDotD RAII handle + PerRunTempdir impl (namespaced timsseek-staging- tempdir prefix). - run_on_staged free-function wrapper so callers can scope tempdirs via a closure; StagingBackend stays dyn-object-safe. - TarReader trait with S3 (range-GET + 64 KiB prefetch) and local (File+seek+read_exact) impls. ustar walker is typeflag-permissive: regular files recorded, pax x/g skipped, GNU L/K hard-error. Early exit once both required basenames located. - S3-prefix materializer: list_capped capped at max_prefix_keys, required-basename preflight, copy via get_to_file with per-file progress bar. - Startup sweep in PerRunTempdir::new reclaims stale timsseek-staging-<id> tempdirs older than stale_sweep_age_hours. Honors .lock sentinel files written at materialization time so long-running concurrent stages aren't swept. Tests: 31 lib (error + uri + resolve + backend + walker) + 2 local_tar_staging + 2 prefix_staging + 4 sweep (with filetime dev-dep for mtime backdating).

…ld_cli Phase 10: timsquery gains `load_index(uri, backend, save_sidecar, centroid_cfg)` — the eager-path orchestrator. Dispatches on tims_stage::resolve: LocalIdx/RemoteIdx go straight through IndexedTimstofPeaks::load_from_storage; LocalDotD builds directly; Stageable stages via the backend, builds, optionally writes the .idx sidecar via save_to_storage. Viewer's lazy path is untouched. Phase 11: timsseek_cli config fields renamed with serde + clap aliases preserving back-compat: dotd_files -> raw_inputs (Vec<String>) speclib.path -> speclib.uri (String) output.directory -> output.uri (String) New [staging] TOML block (tempdir_root, max_prefix_keys, save_sidecar, stale_sweep_age_hours). Main loop constructs one PerRunTempdir; each sample calls load_index(&raw_uri, &backend, save_sidecar, ...). get_frag_range now reads isolation geometry from the already-built index (index.fragmented_range()), so S3 inputs no longer need the .d post-load. OutputSink handles local vs remote outputs — writes into a tempdir and uploads per-sample via tims_stage::upload_file when the URI is s3://... Also routes config_used.json and run_report.json through finalize_run at the end of a batch. Speclib loading uses a small speclib_from_uri helper: local URIs open directly; remote URIs stream via tims_stage::open_reader into a named tempfile that preserves the extension (so format sniffing still works). Phase 13: speclib_build_cli accepts URIs for --fasta, --peptide-list, and --output. FASTA downloads to a tempfile (ProteinSequenceCollection::from_fasta_file is path-only); peptide list wraps open_reader in BufReader directly; output goes to a local tempfile then upload_file when --output is remote. Tests pass across the workspace (excluding the pyo3 bindings that already fail locally for unrelated FFI-link reasons).

Env-gated MinIO smoke test behind the aws feature (no-ops when MINIO_TEST_ENDPOINT is unset). README gets a compact S3-inputs section; docs/development.md documents the [staging] block + env vars + MinIO invocation.

Remove StagedDotD::borrowed, run_on_staged, and StageError::NotFound — none had a caller. The staging API is dyn-safe via StagingBackend alone; add ergonomic wrappers again if a concrete need shows up. Note on the final review's other blocking item (missing SIGINT/SIGTERM handler wired to ACTIVE_STAGE): this was intentionally dropped during the plan v2 rework in favor of the startup sweep. The spec still describes the handler; it needs updating to match the shipped design.

Move `extension-module` from an unconditional dep feature to an opt-in crate-level feature. Plain `cargo build -p timsquery_pyo3` (and `cargo test --workspace`) now link libpython the normal way — set PYO3_PYTHON to the uv venv python, e.g. `/Users/.../timsbuktoolkit/.venv/bin/python3`. Maturin keeps using the extension-module feature via pyproject's [tool.maturin] features list, so wheel builds are unchanged. Before this change, the unconditional `pyo3/extension-module` suppressed libpython linking unconditionally, which worked for maturin but left plain cargo builds with unresolved `Py_None` / `PyFloat_*` symbols at link time.

…example Three wins measured against a 1.1 GB .idx and a 4 GB .d.tar on S3: 1. timscentroid: drop the serial cloud-load heuristic. The comment claimed serial was better for S3 (avoid connection overhead); benched otherwise — parallel reads saturate the uplink at ~19 MB/s vs ~7 MB/s serial. load_from_storage goes 154s -> 58s. 2. tims_stage::tar: add TarReader::copy_range_to_file that streams one range-GET per payload into the local file. The previous loop issued a separate 4 MiB range-GET per chunk (500+ HTTP round-trips for a 2 GB analysis.tdf_bin). Tar stage goes 544s -> 120s and now beats S3 prefix staging (167s) since it avoids the list round-trip. 3. timscentroid: new StorageProvider::range_get_to_file — single bounded GET whose response body streams to disk, no in-memory buffering of the full range. Used by the new TarReader impl. Also: aws/gcp/azure passthrough features on timsseek_cli and speclib_build_cli so release builds can link the AWS SDK via `cargo build --release --features aws -p timsseek_cli`. Adds rust/timsseek_cli/examples/load_bench.rs — reads LOAD_BENCH_URI, resolves, stages/loads, emits per-phase timings via tracing so each resolve() branch can be timed in isolation. Removes load_from_storage_serial (dead).

…m_uri bug Three review agents flagged overlapping issues. Consolidate: - New crate::common with REQUIRED_DOTD_FILES, make_bar, transport_err. Removes byte-identical make_bar + two REQUIRED constants + ~8 copies of the Transport error wrap across tar.rs and prefix.rs. - Single pub tims_stage::is_remote_uri. Removes four copies of the three-scheme check (timsseek_cli main.rs, speclib_build_cli pipeline.rs, tims_stage open.rs, tims_stage upload.rs) and the inline form in timsquery index_serde.rs and resolve.rs. - Downgrade leaky pubs in tims_stage (UriShape, LocKind, NameKind, parse_uri_shape, redact_uri, TarReader, S3TarReader, LocalTarReader) to pub(crate). They're internal plumbing; nothing outside the crate uses them. Correctness fix: sample_name_from_uri previously used short-circuit `.or_else` which stripped only one suffix. `sample.d.tar` -> strip `.d` fails -> strip `.tar` succeeds -> returns `sample.d` (with the `.d`). `sample.d/` -> strip `.d` succeeds -> returns `sample`. Result: the same dataset under tar vs prefix inputs produced DIFFERENT output subdir names (`sample.d/` vs `sample/`), visible in the PR3 smoke logs. Fix: loop-strip known suffixes; all four inputs now collapse to `sample`. Add 4 unit tests covering each shape. Efficiency fix: speclib_from_uri for remote URIs previously did open_reader (full Bytes buffer in RAM) -> io::copy -> tempfile -> Speclib::from_file (re-read from disk). Replace with StorageProvider::get_to_file (one bounded streaming GET directly to tempfile). Removes the in-memory buffer; same disk footprint.

…erwrite) - timsquery: replace NoMS2DataError placeholder with LazyMaterializationUnsupported, CacheDisabled, InvalidCacheUrl; drop unused bincode/zstd deps. - timscentroid: fix ErrorKind::Other deprecation (io::Error::other). - tims_stage: add PayloadTooLarge variant + DEFAULT_IN_MEMORY_CAP/DEFAULT_UPLOAD_CAP size guards on open_reader/upload_file; new download.rs streaming helper; uri_exists HEAD probe; MinIO smoke test now #[ignore]'d + self-seeds fixture. - timsseek_cli: proactive overwrite check probes every artifact (local Path::exists / remote HEAD) up-front before analysis, aborts with collision list; calib_lib PathBuf -> URI String; unify remote speclib load via download_to_file; ARTIFACT-LIST comments at writer sites keep list in sync with validate_inputs. - timsseek: RunStatus::{Completed,Aborted} + abort_reason on RunReport so consumers can tell a partial run from a finished one.

…> 25MB) Feature defaults: - timsseek_cli and speclib_build_cli ship with [aws] by default so released binaries can read/write s3:// URIs out of the box. GCS/Azure stay opt-in. Binary trim (timsseek 89.9 MB -> 25.1 MB, speclib_build 87.8 -> 14.2 MB): - [profile.release] strip = "symbols" (~3-4 MB on macOS, more on Linux). - arrow / rustyms default-features = false: drops csv/ipc/json from arrow (already unused) and mzdata/zeno/swash/ndarray from the rustyms tree. - patch-crate mechanism replaces rustyms 0.11.0's 60 MB gnome.dat glycan ontology (unused here) with a panic pointing to the issue tracker. Diff lives at patches/rustyms+0.11.0.patch (1.7 KB). Workspace [patch.crates-io] points at target/patch/rustyms-0.11.0/ which `task setup` populates. - `task setup` target installs patch-crate and applies the patch on fresh clones. CI runs the same step in rust.yml and release.yml before build. Misc: - Bump actions/checkout@v4 -> v6 in release.yml to match rust.yml. - Gitignore scratch files: run_remote.bash, stage.bash, here.toml. - Commit python/timsquery_pyo3/uv.lock for reproducible Python builds.

… patch Use the jspaezp/cargo-patch-crate fork in `task setup` and CI instead of the crates.io crate. Two wins: 1. Fork drops the cargo-as-library dep (`cargo = "0.96"`) that pulled libsqlite3-sys and conflicted with timsrust's rusqlite (both declared `links = "sqlite3"`). Install time drops from ~1-2 min (sqlite C compile) to ~15s. 2. Fork commit 13777f1 handles the bootstrap paradox — upstream's cargo_metadata subprocess fails on fresh checkouts because the workspace's [patch.crates-io] points at target/patch/rustyms-0.11.0/ which doesn't exist yet. The fork falls back to direct Cargo.toml parsing when metadata resolution fails. Install command specifies `patch-crate` package because the fork added a second binary crate (broken-patch-fixture) for regression testing; without the name, `cargo install --git` errors with "multiple packages with binaries found". Also strip the binary `gnome.dat` deletion from the patch file. `git apply` rejected "Binary files differ" without --binary delta, which broke `task setup`. Only the ontology.rs text patch is needed: removing the include_bytes!("databases/gnome.dat") call prevents the 60MB file from being baked into the binary regardless of whether it lingers on disk in target/patch/. Verified end-to-end: - cargo uninstall patch-crate && rm -rf target/patch/ - task setup -> installs fork (13s), applies patch - cargo build --release --bin timsseek -> 25.1 MB

- Extract init_tracing() from run(), dropping run() from ~340 to ~260 lines. Instrumentation-feature flush guard is held via a TracingHandle struct whose lifetime spans the caller scope — Box<dyn Any> erases the tracing-profile guard type so the outer scope doesn't need the feature gate. - Add deny_unknown_fields to every SpeclibBuildConfig sub-struct so typos in a speclib-build TOML error out instead of being silently ignored — matches timsseek_cli's config drift discipline. - docs/development.md: first-time-setup section pointing at `task setup` so fresh clones know to run it before `cargo build`. - timsseek_cli Cargo.toml: comment updated to reflect that build.rs automation remains impossible — the jspaezp/cargo-patch-crate fork fixes the tool's own bootstrap paradox but cargo still reads [patch.crates-io] before any build script runs.

`Path::exists()` and `File::open()` don't recognise `~`; paths pulled from config files or unquoted CLI args kept the literal `~/` prefix, so validate_inputs rejected real files with "Speclib file does not exist" even though they were on disk. Add `tims_stage::expand_local_uri(&str) -> String` which replaces `~/` / bare `~` with `$HOME` for local paths, passes remote URIs through unchanged. Apply at the top of validate_inputs to canonicalise speclib, calib_lib, raw_inputs, and output_uri once — every downstream consumer (existence probes, staging, uploads) sees the expanded form. Unit test covers tilde-prefix, bare tilde, absolute path, relative path, and remote URI pass-through.

…output URI Three user-visible CLI fixes. 1. Tracing routing. Before: vanilla `timsseek` invocations with no --log-path and no local output dir dumped all tracing spans to the terminal. Now: tracing always goes to a file unless `--log-path -` explicitly opts in to stderr. Resolution order: --log-path - -> stderr --log-path <p> -> <p> local output.uri set -> <output>/timsseek-<ts>.log remote/no output -> ./timsseek-<ts>.log in CWD Timestamp (%Y%m%dT%H%M%S, local time via chrono::Local) prevents clobbering repeat runs in the same directory. chrono added as a direct dep on timsseek_cli (default-features = false, ["clock"]); already in the dep tree transitively via aws. 2. Per-file wall-clock footer. The batch loop now prints === [N/M] sample === ... phase output ... Output: <dest> === [N/M] sample done in <duration> === so users see total time per input even when multiple files are processed. Failure paths (upload abort, I/O error) print a matching failed footer. 3. Output URI. `run_pipeline` used to `println!("Output: ...")` with the local working path. For remote outputs, that path was a tempdir that got cleaned up immediately after upload — users saw `/tmp/...` for files actually landing at s3://... Now OutputSink exposes `dest_uri_for_sample` (final destination, not working path); main.rs per-file footer prints that URI. processing.rs drops the misleading Output: line. finalize_sample was also unified on the helper, pulling the `format!("{}/{}/{}", ...)` inline duplicate out.

…ript Why: want to visualize where wall time is spent across phases, per file, and per finalize stage. Existing text log has span context but no enter/close events and no durations, so reconstructing a Gantt needed guessing. Changes: - enable tracing-subscriber "json" feature - init_tracing: alongside the text log, open <log>.spans.jsonl as a JSON fmt layer with FmtSpan::NEW|CLOSE so every span emits an open + a close event carrying time.busy / time.idle; skipped when --log-path - - add info_span!("file", name=<basename>) wrapping process_single_file so multi-file runs render as distinct lanes - wrap run-report write + sink.finalize_run in TimedStep("Finalize run") so the tail of a run appears as its own span - scripts/plot_spans.py: parse spans.jsonl, match NEW/CLOSE per (thread, name), draw Gantt by (thread, depth) lane, label with duration + busy%. --min-ms filters out sub-noise spans. Optional --wandb RUN_NAME uploads the PNG as a wandb.Image. Rayon workers still show as roots (tracing current-span is thread-local). Worker parent attach deferred until a specific hot path needs it.

Switch QueryItemToScore.digest and PeptideMetadata.digest from ProteinSlice to Peptide (raw Arc<str> + Option<ParsedSequence> + decoy fields). Add apply_parse_gate() that zeros parsed on every entry if any entry fails to parse, sets Speclib.meta accordingly. Gate runs after all decoy generation at both loader entry points (from_file, from_file_with_format). Add test_parse_gate_off_on_poisoned_row and test_parse_gate_on_when_all_parsed.

…equence Replace `ScoringFields.sequence: String` with `peptide: Peptide`. Builder setter clones the digest directly (no string conversion). Add `Serialize` impl on `Peptide` (serializes as raw string) to satisfy the `#[derive(Serialize)]` on `ScoringFields`. Parquet column name kept as "sequence" for downstream compatibility; accessor swapped to `r.scoring.peptide.as_str()`. processing.rs sort/dedup sites use `.as_str().cmp()` / `==` comparisons.

Refactor CompetedCandidate::as_feature into paired named_features() -> Vec<(f64, &'static str)> as a single source of truth for values and names. Expose feature_names() for introspection. Append 22 dims (peptide_length + 20 AA counts + peptide_n_mods) when peptide.parsed.is_some(); the gate is speclib-wide so vector length is stable within a fit. Add AA_COUNT_NAMES const aligned with CANONICAL_AA_LETTERS output order. Four new feature_tests confirm lengths match, gate-on/off presence, and delta is exactly 22.

Thread parsable_sequences: bool through ResultParquetWriter::new, write it as a KeyValue entry in WriterProperties; pass speclib.meta.parsable_sequences from execute_pipeline. Adds full read-back unit test via SerializedFileReader.

After rescoring, compute per-fold feature means and Forust Gain importance via CrossValidatedScorer::feature_stats(). rescore() now returns (Vec<FinalResult>, RescoreFeatureStats). The CLI writes results.feature_stats.json next to results.parquet by default; suppress with --no-feature-stats.

…e_to_proforma

jspaezp added 30 commits April 21, 2026 11:51

test(tims_stage) + docs: MinIO smoke + S3 usage notes

618dc6a

Env-gated MinIO smoke test behind the aws feature (no-ops when MINIO_TEST_ENDPOINT is unset). README gets a compact S3-inputs section; docs/development.md documents the [staging] block + env vars + MinIO invocation.

chore(release): v0.30.0 -> v0.31.0

dcf4b36

feat(timsseek): scaffold Peptide + ParsedSequence types

5ecd349

feat(timsseek): normalize_to_proforma for DIA-NN/short-form strings

be52fde

feat(timsseek): parse_sequence bridges rustyms into ParsedSequence

12e109c

refactor(timsseek): rename DigestSlice to ProteinSlice

84fb981

refactor(timsseek): drop NonReversedDecoy + dead decoy helpers

9ddf143

feat(timsseek): DIA-NN loader uses modified_peptide for Peptide.raw

5307cfc

docs(timsseek): DATA_FLOW.md reflects Peptide on QueryItemToScore

913100c

jspaezp added 2 commits April 23, 2026 14:09

refactor: simplify feature-stats + derived AA names + shared normaliz…

b71e54a

…e_to_proforma

jspaezp changed the title ~~Feat/s3 staging~~ v0.31.0: S3 staging + Peptide types + sequence-derived rescore features May 5, 2026

jspaezp merged commit af901bd into main May 5, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.31.0: S3 staging + Peptide types + sequence-derived rescore features#60

v0.31.0: S3 staging + Peptide types + sequence-derived rescore features#60
jspaezp merged 32 commits into
mainfrom
feat/s3-staging

jspaezp commented Apr 22, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jspaezp commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

S3 staging (tims_stage crate)

Peptide / ParsedSequence types

Feature stats + tracing

Release / build

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jspaezp commented Apr 22, 2026 •

edited

Loading