Skip to content

v0.31.0: S3 staging + Peptide types + sequence-derived rescore features#60

Merged
jspaezp merged 32 commits into
mainfrom
feat/s3-staging
May 5, 2026
Merged

v0.31.0: S3 staging + Peptide types + sequence-derived rescore features#60
jspaezp merged 32 commits into
mainfrom
feat/s3-staging

Conversation

@jspaezp
Copy link
Copy Markdown
Collaborator

@jspaezp jspaezp commented Apr 22, 2026

Summary

Three concurrent threads landing as v0.31.0:

S3 staging (tims_stage crate)

  • New tims_stage crate: URI resolution, open_reader, upload_file, staging backend, tar walker, prefix materializer, sweep.
  • timsseek_cli and speclib_build_cli accept S3 URIs for inputs and outputs.
  • Layer-0 I/O primitives promoted out of timscentroid; TimedStep moved down a layer.
  • Perf: parallel idx load, streaming tar payload fetch (see load_bench example).
  • aws is now a default cargo feature.
  • MinIO smoke test + S3 usage notes.

Peptide / ParsedSequence types

  • New Peptide + ParsedSequence; parse_sequence bridges rustyms.
  • normalize_to_proforma handles DIA-NN / short-form strings (shared).
  • Threaded through speclib, scoring, parquet writer (ScoringFields carries Peptide, not String).
  • DIA-NN loader uses modified_peptide for Peptide.raw.
  • All-or-none parse gate; parsable_sequences flag emitted into parquet metadata.
  • DigestSlice -> ProteinSlice rename; dead decoy helpers / NonReversedDecoy removed.
  • 22 sequence-derived rescore features (gated on parse success).

Feature stats + tracing

  • Per-fold feature means + gain importance sidecar.
  • init_tracing split; tracing to file by default, per-file timing, real output URI.
  • Span events emitted as jsonl sibling; scripts/plot_spans.py for gantt plots.
  • ~ expansion in local URIs before filesystem probes.

Release / build

  • Workspace 0.30.0 -> 0.31.0 (Cargo + pyproject).
  • Release binary 90MB -> 25MB: rustyms gnome.dat stub via patch-crate (jspaezp fork), strip = "symbols", arrow default features dropped.
  • timsquery_pyo3 extension-module behind a cargo feature (fixes downstream linkage).
  • DATA_FLOW.md updated.

Test plan

  • cargo build (default + --no-default-features) green on HEAD b71e54a
  • MinIO smoke test for S3 paths
  • End-to-end run on real dataset before tag
  • Tag v0.31.0 after merge

jspaezp added 30 commits April 21, 2026 11:51
Add StorageProvider::open (non-creating), head, exists, get_bytes,
range_get (with short-read detection), list_capped, and get_to_file.
Route write_bytes through object_store::buffered::BufWriter so single-
PUT vs multipart upload is automatic. Add SerializationError::
PrefixCapExceeded variant for capped listing errors. Foundation for
the upcoming tims_stage crate.

Drive-by: import ProgressFinish in speclib_build_cli/pipeline.rs
(missing since ad3c0c3 introduced the usage) to unblock the
workspace build.

task fmt run; includes whitespace-only updates to
timsseek_cli/processing.rs.
Relocate TimedStep + LABEL_WIDTH from timsseek::scoring::timings into
timscentroid so the upcoming tims_stage crate can reuse it without a
layer inversion. timsseek's scoring::timings keeps its scoring-specific
report types (ScoreTimings, PrescoreTimings, PipelineReport, etc.) and
re-exports TimedStep from timscentroid so existing caller paths
(timsseek::TimedStep, timsseek::scoring::TimedStep,
timsseek::scoring::timings::TimedStep) all continue to work.
New workspace crate housing Layers 1 (resolution) and 2 (staging
policy) from the S3 staging design. Empty module stubs + full
StageError enum + redact_uri helper (strips query strings so
presigned URLs don't leak into error messages). Depends on
timscentroid for Layer-0 I/O primitives; no direct tokio/aws deps.
Feature flags aws/gcp/azure pass through to timscentroid.
Fill the scaffolded modules:

- uri: parse_uri_shape (LocKind + NameKind), sidecar_of, split_uri
  (unified splitter used by every remote call site), canonical_uri
  (trailing-slash-stripped, case-sensitive for S3, canonicalize for
  local).
- resolve: Resolved enum (LocalIdx/RemoteIdx/LocalDotD/Stageable) +
  SourceSpec (S3Tar/S3Prefix/LocalTar) + resolve() decision tree.
  Sidecar probe HEADs `<sidecar_uri>/metadata.json` because the
  `.idx` is a directory, not a blob. Uses StorageProvider::open so
  resolution never materializes local directories as a side effect.
- open: open_reader(uri) returns Box<dyn Read + Send>. Remote URIs
  are buffered whole into Cursor<Bytes> to keep rayon-safety.
- upload: upload_file(local, dest_uri). Local -> local is direct;
  local -> remote routes through StorageProvider::new (BufWriter).

Errors wrap transport failures with redact_uri() so presigned-URL
query strings never leak into user-visible messages.
…weep

Layer 2 landed:

- StagingBackend trait + StagedDotD RAII handle + PerRunTempdir impl
  (namespaced timsseek-staging- tempdir prefix).
- run_on_staged free-function wrapper so callers can scope tempdirs
  via a closure; StagingBackend stays dyn-object-safe.
- TarReader trait with S3 (range-GET + 64 KiB prefetch) and local
  (File+seek+read_exact) impls. ustar walker is typeflag-permissive:
  regular files recorded, pax x/g skipped, GNU L/K hard-error. Early
  exit once both required basenames located.
- S3-prefix materializer: list_capped capped at max_prefix_keys,
  required-basename preflight, copy via get_to_file with per-file
  progress bar.
- Startup sweep in PerRunTempdir::new reclaims stale
  timsseek-staging-<id> tempdirs older than stale_sweep_age_hours.
  Honors .lock sentinel files written at materialization time so
  long-running concurrent stages aren't swept.

Tests: 31 lib (error + uri + resolve + backend + walker) + 2
local_tar_staging + 2 prefix_staging + 4 sweep (with filetime dev-dep
for mtime backdating).
…ld_cli

Phase 10: timsquery gains `load_index(uri, backend, save_sidecar,
centroid_cfg)` — the eager-path orchestrator. Dispatches on
tims_stage::resolve: LocalIdx/RemoteIdx go straight through
IndexedTimstofPeaks::load_from_storage; LocalDotD builds directly;
Stageable stages via the backend, builds, optionally writes the
.idx sidecar via save_to_storage. Viewer's lazy path is untouched.

Phase 11: timsseek_cli config fields renamed with serde + clap
aliases preserving back-compat:
  dotd_files    -> raw_inputs       (Vec<String>)
  speclib.path  -> speclib.uri      (String)
  output.directory -> output.uri    (String)
New [staging] TOML block (tempdir_root, max_prefix_keys,
save_sidecar, stale_sweep_age_hours). Main loop constructs one
PerRunTempdir; each sample calls load_index(&raw_uri, &backend,
save_sidecar, ...). get_frag_range now reads isolation geometry
from the already-built index (index.fragmented_range()), so S3
inputs no longer need the .d post-load.

OutputSink handles local vs remote outputs — writes into a
tempdir and uploads per-sample via tims_stage::upload_file when
the URI is s3://... Also routes config_used.json and run_report.json
through finalize_run at the end of a batch.

Speclib loading uses a small speclib_from_uri helper: local URIs
open directly; remote URIs stream via tims_stage::open_reader
into a named tempfile that preserves the extension (so format
sniffing still works).

Phase 13: speclib_build_cli accepts URIs for --fasta,
--peptide-list, and --output. FASTA downloads to a tempfile
(ProteinSequenceCollection::from_fasta_file is path-only);
peptide list wraps open_reader in BufReader directly; output
goes to a local tempfile then upload_file when --output is
remote.

Tests pass across the workspace (excluding the pyo3 bindings that
already fail locally for unrelated FFI-link reasons).
Env-gated MinIO smoke test behind the aws feature (no-ops when
MINIO_TEST_ENDPOINT is unset). README gets a compact S3-inputs
section; docs/development.md documents the [staging] block + env
vars + MinIO invocation.
Remove StagedDotD::borrowed, run_on_staged, and StageError::NotFound
— none had a caller. The staging API is dyn-safe via StagingBackend
alone; add ergonomic wrappers again if a concrete need shows up.

Note on the final review's other blocking item (missing SIGINT/SIGTERM
handler wired to ACTIVE_STAGE): this was intentionally dropped during
the plan v2 rework in favor of the startup sweep. The spec still
describes the handler; it needs updating to match the shipped design.
Move `extension-module` from an unconditional dep feature to an opt-in
crate-level feature. Plain `cargo build -p timsquery_pyo3` (and
`cargo test --workspace`) now link libpython the normal way — set
PYO3_PYTHON to the uv venv python, e.g.
`/Users/.../timsbuktoolkit/.venv/bin/python3`. Maturin keeps using the
extension-module feature via pyproject's [tool.maturin] features list,
so wheel builds are unchanged.

Before this change, the unconditional `pyo3/extension-module`
suppressed libpython linking unconditionally, which worked for maturin
but left plain cargo builds with unresolved `Py_None` / `PyFloat_*`
symbols at link time.
…example

Three wins measured against a 1.1 GB .idx and a 4 GB .d.tar on S3:

1. timscentroid: drop the serial cloud-load heuristic. The comment
   claimed serial was better for S3 (avoid connection overhead); benched
   otherwise — parallel reads saturate the uplink at ~19 MB/s vs ~7 MB/s
   serial. load_from_storage goes 154s -> 58s.

2. tims_stage::tar: add TarReader::copy_range_to_file that streams one
   range-GET per payload into the local file. The previous loop issued
   a separate 4 MiB range-GET per chunk (500+ HTTP round-trips for a
   2 GB analysis.tdf_bin). Tar stage goes 544s -> 120s and now beats
   S3 prefix staging (167s) since it avoids the list round-trip.

3. timscentroid: new StorageProvider::range_get_to_file — single
   bounded GET whose response body streams to disk, no in-memory
   buffering of the full range. Used by the new TarReader impl.

Also: aws/gcp/azure passthrough features on timsseek_cli and
speclib_build_cli so release builds can link the AWS SDK via
`cargo build --release --features aws -p timsseek_cli`.

Adds rust/timsseek_cli/examples/load_bench.rs — reads LOAD_BENCH_URI,
resolves, stages/loads, emits per-phase timings via tracing so each
resolve() branch can be timed in isolation.

Removes load_from_storage_serial (dead).
…m_uri bug

Three review agents flagged overlapping issues. Consolidate:

- New crate::common with REQUIRED_DOTD_FILES, make_bar, transport_err.
  Removes byte-identical make_bar + two REQUIRED constants + ~8
  copies of the Transport error wrap across tar.rs and prefix.rs.
- Single pub tims_stage::is_remote_uri. Removes four copies of the
  three-scheme check (timsseek_cli main.rs, speclib_build_cli
  pipeline.rs, tims_stage open.rs, tims_stage upload.rs) and the
  inline form in timsquery index_serde.rs and resolve.rs.
- Downgrade leaky pubs in tims_stage (UriShape, LocKind, NameKind,
  parse_uri_shape, redact_uri, TarReader, S3TarReader, LocalTarReader)
  to pub(crate). They're internal plumbing; nothing outside the crate
  uses them.

Correctness fix: sample_name_from_uri previously used short-circuit
`.or_else` which stripped only one suffix. `sample.d.tar` -> strip
`.d` fails -> strip `.tar` succeeds -> returns `sample.d` (with the
`.d`). `sample.d/` -> strip `.d` succeeds -> returns `sample`.
Result: the same dataset under tar vs prefix inputs produced
DIFFERENT output subdir names (`sample.d/` vs `sample/`), visible
in the PR3 smoke logs. Fix: loop-strip known suffixes; all four
inputs now collapse to `sample`. Add 4 unit tests covering each
shape.

Efficiency fix: speclib_from_uri for remote URIs previously did
open_reader (full Bytes buffer in RAM) -> io::copy -> tempfile ->
Speclib::from_file (re-read from disk). Replace with
StorageProvider::get_to_file (one bounded streaming GET directly to
tempfile). Removes the in-memory buffer; same disk footprint.
…erwrite)

- timsquery: replace NoMS2DataError placeholder with LazyMaterializationUnsupported,
  CacheDisabled, InvalidCacheUrl; drop unused bincode/zstd deps.
- timscentroid: fix ErrorKind::Other deprecation (io::Error::other).
- tims_stage: add PayloadTooLarge variant + DEFAULT_IN_MEMORY_CAP/DEFAULT_UPLOAD_CAP
  size guards on open_reader/upload_file; new download.rs streaming helper; uri_exists
  HEAD probe; MinIO smoke test now #[ignore]'d + self-seeds fixture.
- timsseek_cli: proactive overwrite check probes every artifact (local Path::exists
  / remote HEAD) up-front before analysis, aborts with collision list; calib_lib
  PathBuf -> URI String; unify remote speclib load via download_to_file; ARTIFACT-LIST
  comments at writer sites keep list in sync with validate_inputs.
- timsseek: RunStatus::{Completed,Aborted} + abort_reason on RunReport so consumers
  can tell a partial run from a finished one.
…> 25MB)

Feature defaults:
- timsseek_cli and speclib_build_cli ship with [aws] by default so released
  binaries can read/write s3:// URIs out of the box. GCS/Azure stay opt-in.

Binary trim (timsseek 89.9 MB -> 25.1 MB, speclib_build 87.8 -> 14.2 MB):
- [profile.release] strip = "symbols" (~3-4 MB on macOS, more on Linux).
- arrow / rustyms default-features = false: drops csv/ipc/json from arrow
  (already unused) and mzdata/zeno/swash/ndarray from the rustyms tree.
- patch-crate mechanism replaces rustyms 0.11.0's 60 MB gnome.dat glycan
  ontology (unused here) with a panic pointing to the issue tracker. Diff
  lives at patches/rustyms+0.11.0.patch (1.7 KB). Workspace [patch.crates-io]
  points at target/patch/rustyms-0.11.0/ which `task setup` populates.
- `task setup` target installs patch-crate and applies the patch on fresh
  clones. CI runs the same step in rust.yml and release.yml before build.

Misc:
- Bump actions/checkout@v4 -> v6 in release.yml to match rust.yml.
- Gitignore scratch files: run_remote.bash, stage.bash, here.toml.
- Commit python/timsquery_pyo3/uv.lock for reproducible Python builds.
… patch

Use the jspaezp/cargo-patch-crate fork in `task setup` and CI instead of the
crates.io crate. Two wins:

1. Fork drops the cargo-as-library dep (`cargo = "0.96"`) that pulled
   libsqlite3-sys and conflicted with timsrust's rusqlite (both declared
   `links = "sqlite3"`). Install time drops from ~1-2 min (sqlite C compile)
   to ~15s.

2. Fork commit 13777f1 handles the bootstrap paradox — upstream's
   cargo_metadata subprocess fails on fresh checkouts because the workspace's
   [patch.crates-io] points at target/patch/rustyms-0.11.0/ which doesn't
   exist yet. The fork falls back to direct Cargo.toml parsing when metadata
   resolution fails.

Install command specifies `patch-crate` package because the fork added a
second binary crate (broken-patch-fixture) for regression testing; without
the name, `cargo install --git` errors with "multiple packages with
binaries found".

Also strip the binary `gnome.dat` deletion from the patch file. `git apply`
rejected "Binary files differ" without --binary delta, which broke
`task setup`. Only the ontology.rs text patch is needed: removing the
include_bytes!("databases/gnome.dat") call prevents the 60MB file from
being baked into the binary regardless of whether it lingers on disk in
target/patch/.

Verified end-to-end:
- cargo uninstall patch-crate && rm -rf target/patch/
- task setup  -> installs fork (13s), applies patch
- cargo build --release --bin timsseek  -> 25.1 MB
- Extract init_tracing() from run(), dropping run() from ~340 to ~260 lines.
  Instrumentation-feature flush guard is held via a TracingHandle struct
  whose lifetime spans the caller scope — Box<dyn Any> erases the
  tracing-profile guard type so the outer scope doesn't need the feature
  gate.
- Add deny_unknown_fields to every SpeclibBuildConfig sub-struct so typos
  in a speclib-build TOML error out instead of being silently ignored —
  matches timsseek_cli's config drift discipline.
- docs/development.md: first-time-setup section pointing at `task setup`
  so fresh clones know to run it before `cargo build`.
- timsseek_cli Cargo.toml: comment updated to reflect that build.rs
  automation remains impossible — the jspaezp/cargo-patch-crate fork fixes
  the tool's own bootstrap paradox but cargo still reads [patch.crates-io]
  before any build script runs.
`Path::exists()` and `File::open()` don't recognise `~`; paths pulled from
config files or unquoted CLI args kept the literal `~/` prefix, so
validate_inputs rejected real files with "Speclib file does not exist"
even though they were on disk.

Add `tims_stage::expand_local_uri(&str) -> String` which replaces `~/` /
bare `~` with `$HOME` for local paths, passes remote URIs through
unchanged. Apply at the top of validate_inputs to canonicalise speclib,
calib_lib, raw_inputs, and output_uri once — every downstream consumer
(existence probes, staging, uploads) sees the expanded form.

Unit test covers tilde-prefix, bare tilde, absolute path, relative path,
and remote URI pass-through.
…output URI

Three user-visible CLI fixes.

1. Tracing routing. Before: vanilla `timsseek` invocations with no
   --log-path and no local output dir dumped all tracing spans to the
   terminal. Now: tracing always goes to a file unless `--log-path -`
   explicitly opts in to stderr. Resolution order:
     --log-path -                 -> stderr
     --log-path <p>               -> <p>
     local output.uri set         -> <output>/timsseek-<ts>.log
     remote/no output             -> ./timsseek-<ts>.log in CWD
   Timestamp (%Y%m%dT%H%M%S, local time via chrono::Local) prevents
   clobbering repeat runs in the same directory. chrono added as a
   direct dep on timsseek_cli (default-features = false, ["clock"]);
   already in the dep tree transitively via aws.

2. Per-file wall-clock footer. The batch loop now prints
     === [N/M] sample ===
     ... phase output ...
     Output: <dest>
     === [N/M] sample done in <duration> ===
   so users see total time per input even when multiple files are
   processed. Failure paths (upload abort, I/O error) print a matching
   failed footer.

3. Output URI. `run_pipeline` used to `println!("Output: ...")` with the
   local working path. For remote outputs, that path was a tempdir that
   got cleaned up immediately after upload — users saw `/tmp/...` for
   files actually landing at s3://... Now OutputSink exposes
   `dest_uri_for_sample` (final destination, not working path); main.rs
   per-file footer prints that URI. processing.rs drops the misleading
   Output: line. finalize_sample was also unified on the helper, pulling
   the `format!("{}/{}/{}", ...)` inline duplicate out.
…ript

Why: want to visualize where wall time is spent across phases, per file, and
per finalize stage. Existing text log has span context but no enter/close
events and no durations, so reconstructing a Gantt needed guessing.

Changes:
- enable tracing-subscriber "json" feature
- init_tracing: alongside the text log, open <log>.spans.jsonl as a JSON fmt
  layer with FmtSpan::NEW|CLOSE so every span emits an open + a close event
  carrying time.busy / time.idle; skipped when --log-path -
- add info_span!("file", name=<basename>) wrapping process_single_file so
  multi-file runs render as distinct lanes
- wrap run-report write + sink.finalize_run in TimedStep("Finalize run") so
  the tail of a run appears as its own span
- scripts/plot_spans.py: parse spans.jsonl, match NEW/CLOSE per
  (thread, name), draw Gantt by (thread, depth) lane, label with
  duration + busy%. --min-ms filters out sub-noise spans. Optional
  --wandb RUN_NAME uploads the PNG as a wandb.Image.

Rayon workers still show as roots (tracing current-span is thread-local).
Worker parent attach deferred until a specific hot path needs it.
Switch QueryItemToScore.digest and PeptideMetadata.digest from ProteinSlice
to Peptide (raw Arc<str> + Option<ParsedSequence> + decoy fields). Add
apply_parse_gate() that zeros parsed on every entry if any entry fails to
parse, sets Speclib.meta accordingly. Gate runs after all decoy generation
at both loader entry points (from_file, from_file_with_format). Add
test_parse_gate_off_on_poisoned_row and test_parse_gate_on_when_all_parsed.
…equence

Replace `ScoringFields.sequence: String` with `peptide: Peptide`.
Builder setter clones the digest directly (no string conversion).
Add `Serialize` impl on `Peptide` (serializes as raw string) to satisfy
the `#[derive(Serialize)]` on `ScoringFields`. Parquet column name
kept as "sequence" for downstream compatibility; accessor swapped to
`r.scoring.peptide.as_str()`. processing.rs sort/dedup sites use
`.as_str().cmp()` / `==` comparisons.
Refactor CompetedCandidate::as_feature into paired named_features() ->
Vec<(f64, &'static str)> as a single source of truth for values and
names. Expose feature_names() for introspection. Append 22 dims
(peptide_length + 20 AA counts + peptide_n_mods) when
peptide.parsed.is_some(); the gate is speclib-wide so vector length is
stable within a fit. Add AA_COUNT_NAMES const aligned with
CANONICAL_AA_LETTERS output order. Four new feature_tests confirm
lengths match, gate-on/off presence, and delta is exactly 22.
Thread parsable_sequences: bool through ResultParquetWriter::new, write
it as a KeyValue entry in WriterProperties; pass speclib.meta.parsable_sequences
from execute_pipeline. Adds full read-back unit test via SerializedFileReader.
jspaezp added 2 commits April 23, 2026 14:09
After rescoring, compute per-fold feature means and Forust Gain importance
via CrossValidatedScorer::feature_stats(). rescore() now returns
(Vec<FinalResult>, RescoreFeatureStats). The CLI writes
results.feature_stats.json next to results.parquet by default; suppress
with --no-feature-stats.
@jspaezp jspaezp changed the title Feat/s3 staging v0.31.0: S3 staging + Peptide types + sequence-derived rescore features May 5, 2026
@jspaezp jspaezp merged commit af901bd into main May 5, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant