v0.31.0: S3 staging + Peptide types + sequence-derived rescore features#60
Merged
Conversation
Add StorageProvider::open (non-creating), head, exists, get_bytes, range_get (with short-read detection), list_capped, and get_to_file. Route write_bytes through object_store::buffered::BufWriter so single- PUT vs multipart upload is automatic. Add SerializationError:: PrefixCapExceeded variant for capped listing errors. Foundation for the upcoming tims_stage crate. Drive-by: import ProgressFinish in speclib_build_cli/pipeline.rs (missing since ad3c0c3 introduced the usage) to unblock the workspace build. task fmt run; includes whitespace-only updates to timsseek_cli/processing.rs.
Relocate TimedStep + LABEL_WIDTH from timsseek::scoring::timings into timscentroid so the upcoming tims_stage crate can reuse it without a layer inversion. timsseek's scoring::timings keeps its scoring-specific report types (ScoreTimings, PrescoreTimings, PipelineReport, etc.) and re-exports TimedStep from timscentroid so existing caller paths (timsseek::TimedStep, timsseek::scoring::TimedStep, timsseek::scoring::timings::TimedStep) all continue to work.
New workspace crate housing Layers 1 (resolution) and 2 (staging policy) from the S3 staging design. Empty module stubs + full StageError enum + redact_uri helper (strips query strings so presigned URLs don't leak into error messages). Depends on timscentroid for Layer-0 I/O primitives; no direct tokio/aws deps. Feature flags aws/gcp/azure pass through to timscentroid.
Fill the scaffolded modules: - uri: parse_uri_shape (LocKind + NameKind), sidecar_of, split_uri (unified splitter used by every remote call site), canonical_uri (trailing-slash-stripped, case-sensitive for S3, canonicalize for local). - resolve: Resolved enum (LocalIdx/RemoteIdx/LocalDotD/Stageable) + SourceSpec (S3Tar/S3Prefix/LocalTar) + resolve() decision tree. Sidecar probe HEADs `<sidecar_uri>/metadata.json` because the `.idx` is a directory, not a blob. Uses StorageProvider::open so resolution never materializes local directories as a side effect. - open: open_reader(uri) returns Box<dyn Read + Send>. Remote URIs are buffered whole into Cursor<Bytes> to keep rayon-safety. - upload: upload_file(local, dest_uri). Local -> local is direct; local -> remote routes through StorageProvider::new (BufWriter). Errors wrap transport failures with redact_uri() so presigned-URL query strings never leak into user-visible messages.
…weep Layer 2 landed: - StagingBackend trait + StagedDotD RAII handle + PerRunTempdir impl (namespaced timsseek-staging- tempdir prefix). - run_on_staged free-function wrapper so callers can scope tempdirs via a closure; StagingBackend stays dyn-object-safe. - TarReader trait with S3 (range-GET + 64 KiB prefetch) and local (File+seek+read_exact) impls. ustar walker is typeflag-permissive: regular files recorded, pax x/g skipped, GNU L/K hard-error. Early exit once both required basenames located. - S3-prefix materializer: list_capped capped at max_prefix_keys, required-basename preflight, copy via get_to_file with per-file progress bar. - Startup sweep in PerRunTempdir::new reclaims stale timsseek-staging-<id> tempdirs older than stale_sweep_age_hours. Honors .lock sentinel files written at materialization time so long-running concurrent stages aren't swept. Tests: 31 lib (error + uri + resolve + backend + walker) + 2 local_tar_staging + 2 prefix_staging + 4 sweep (with filetime dev-dep for mtime backdating).
…ld_cli Phase 10: timsquery gains `load_index(uri, backend, save_sidecar, centroid_cfg)` — the eager-path orchestrator. Dispatches on tims_stage::resolve: LocalIdx/RemoteIdx go straight through IndexedTimstofPeaks::load_from_storage; LocalDotD builds directly; Stageable stages via the backend, builds, optionally writes the .idx sidecar via save_to_storage. Viewer's lazy path is untouched. Phase 11: timsseek_cli config fields renamed with serde + clap aliases preserving back-compat: dotd_files -> raw_inputs (Vec<String>) speclib.path -> speclib.uri (String) output.directory -> output.uri (String) New [staging] TOML block (tempdir_root, max_prefix_keys, save_sidecar, stale_sweep_age_hours). Main loop constructs one PerRunTempdir; each sample calls load_index(&raw_uri, &backend, save_sidecar, ...). get_frag_range now reads isolation geometry from the already-built index (index.fragmented_range()), so S3 inputs no longer need the .d post-load. OutputSink handles local vs remote outputs — writes into a tempdir and uploads per-sample via tims_stage::upload_file when the URI is s3://... Also routes config_used.json and run_report.json through finalize_run at the end of a batch. Speclib loading uses a small speclib_from_uri helper: local URIs open directly; remote URIs stream via tims_stage::open_reader into a named tempfile that preserves the extension (so format sniffing still works). Phase 13: speclib_build_cli accepts URIs for --fasta, --peptide-list, and --output. FASTA downloads to a tempfile (ProteinSequenceCollection::from_fasta_file is path-only); peptide list wraps open_reader in BufReader directly; output goes to a local tempfile then upload_file when --output is remote. Tests pass across the workspace (excluding the pyo3 bindings that already fail locally for unrelated FFI-link reasons).
Env-gated MinIO smoke test behind the aws feature (no-ops when MINIO_TEST_ENDPOINT is unset). README gets a compact S3-inputs section; docs/development.md documents the [staging] block + env vars + MinIO invocation.
Remove StagedDotD::borrowed, run_on_staged, and StageError::NotFound — none had a caller. The staging API is dyn-safe via StagingBackend alone; add ergonomic wrappers again if a concrete need shows up. Note on the final review's other blocking item (missing SIGINT/SIGTERM handler wired to ACTIVE_STAGE): this was intentionally dropped during the plan v2 rework in favor of the startup sweep. The spec still describes the handler; it needs updating to match the shipped design.
Move `extension-module` from an unconditional dep feature to an opt-in crate-level feature. Plain `cargo build -p timsquery_pyo3` (and `cargo test --workspace`) now link libpython the normal way — set PYO3_PYTHON to the uv venv python, e.g. `/Users/.../timsbuktoolkit/.venv/bin/python3`. Maturin keeps using the extension-module feature via pyproject's [tool.maturin] features list, so wheel builds are unchanged. Before this change, the unconditional `pyo3/extension-module` suppressed libpython linking unconditionally, which worked for maturin but left plain cargo builds with unresolved `Py_None` / `PyFloat_*` symbols at link time.
…example Three wins measured against a 1.1 GB .idx and a 4 GB .d.tar on S3: 1. timscentroid: drop the serial cloud-load heuristic. The comment claimed serial was better for S3 (avoid connection overhead); benched otherwise — parallel reads saturate the uplink at ~19 MB/s vs ~7 MB/s serial. load_from_storage goes 154s -> 58s. 2. tims_stage::tar: add TarReader::copy_range_to_file that streams one range-GET per payload into the local file. The previous loop issued a separate 4 MiB range-GET per chunk (500+ HTTP round-trips for a 2 GB analysis.tdf_bin). Tar stage goes 544s -> 120s and now beats S3 prefix staging (167s) since it avoids the list round-trip. 3. timscentroid: new StorageProvider::range_get_to_file — single bounded GET whose response body streams to disk, no in-memory buffering of the full range. Used by the new TarReader impl. Also: aws/gcp/azure passthrough features on timsseek_cli and speclib_build_cli so release builds can link the AWS SDK via `cargo build --release --features aws -p timsseek_cli`. Adds rust/timsseek_cli/examples/load_bench.rs — reads LOAD_BENCH_URI, resolves, stages/loads, emits per-phase timings via tracing so each resolve() branch can be timed in isolation. Removes load_from_storage_serial (dead).
…m_uri bug Three review agents flagged overlapping issues. Consolidate: - New crate::common with REQUIRED_DOTD_FILES, make_bar, transport_err. Removes byte-identical make_bar + two REQUIRED constants + ~8 copies of the Transport error wrap across tar.rs and prefix.rs. - Single pub tims_stage::is_remote_uri. Removes four copies of the three-scheme check (timsseek_cli main.rs, speclib_build_cli pipeline.rs, tims_stage open.rs, tims_stage upload.rs) and the inline form in timsquery index_serde.rs and resolve.rs. - Downgrade leaky pubs in tims_stage (UriShape, LocKind, NameKind, parse_uri_shape, redact_uri, TarReader, S3TarReader, LocalTarReader) to pub(crate). They're internal plumbing; nothing outside the crate uses them. Correctness fix: sample_name_from_uri previously used short-circuit `.or_else` which stripped only one suffix. `sample.d.tar` -> strip `.d` fails -> strip `.tar` succeeds -> returns `sample.d` (with the `.d`). `sample.d/` -> strip `.d` succeeds -> returns `sample`. Result: the same dataset under tar vs prefix inputs produced DIFFERENT output subdir names (`sample.d/` vs `sample/`), visible in the PR3 smoke logs. Fix: loop-strip known suffixes; all four inputs now collapse to `sample`. Add 4 unit tests covering each shape. Efficiency fix: speclib_from_uri for remote URIs previously did open_reader (full Bytes buffer in RAM) -> io::copy -> tempfile -> Speclib::from_file (re-read from disk). Replace with StorageProvider::get_to_file (one bounded streaming GET directly to tempfile). Removes the in-memory buffer; same disk footprint.
…erwrite)
- timsquery: replace NoMS2DataError placeholder with LazyMaterializationUnsupported,
CacheDisabled, InvalidCacheUrl; drop unused bincode/zstd deps.
- timscentroid: fix ErrorKind::Other deprecation (io::Error::other).
- tims_stage: add PayloadTooLarge variant + DEFAULT_IN_MEMORY_CAP/DEFAULT_UPLOAD_CAP
size guards on open_reader/upload_file; new download.rs streaming helper; uri_exists
HEAD probe; MinIO smoke test now #[ignore]'d + self-seeds fixture.
- timsseek_cli: proactive overwrite check probes every artifact (local Path::exists
/ remote HEAD) up-front before analysis, aborts with collision list; calib_lib
PathBuf -> URI String; unify remote speclib load via download_to_file; ARTIFACT-LIST
comments at writer sites keep list in sync with validate_inputs.
- timsseek: RunStatus::{Completed,Aborted} + abort_reason on RunReport so consumers
can tell a partial run from a finished one.
…> 25MB) Feature defaults: - timsseek_cli and speclib_build_cli ship with [aws] by default so released binaries can read/write s3:// URIs out of the box. GCS/Azure stay opt-in. Binary trim (timsseek 89.9 MB -> 25.1 MB, speclib_build 87.8 -> 14.2 MB): - [profile.release] strip = "symbols" (~3-4 MB on macOS, more on Linux). - arrow / rustyms default-features = false: drops csv/ipc/json from arrow (already unused) and mzdata/zeno/swash/ndarray from the rustyms tree. - patch-crate mechanism replaces rustyms 0.11.0's 60 MB gnome.dat glycan ontology (unused here) with a panic pointing to the issue tracker. Diff lives at patches/rustyms+0.11.0.patch (1.7 KB). Workspace [patch.crates-io] points at target/patch/rustyms-0.11.0/ which `task setup` populates. - `task setup` target installs patch-crate and applies the patch on fresh clones. CI runs the same step in rust.yml and release.yml before build. Misc: - Bump actions/checkout@v4 -> v6 in release.yml to match rust.yml. - Gitignore scratch files: run_remote.bash, stage.bash, here.toml. - Commit python/timsquery_pyo3/uv.lock for reproducible Python builds.
… patch
Use the jspaezp/cargo-patch-crate fork in `task setup` and CI instead of the
crates.io crate. Two wins:
1. Fork drops the cargo-as-library dep (`cargo = "0.96"`) that pulled
libsqlite3-sys and conflicted with timsrust's rusqlite (both declared
`links = "sqlite3"`). Install time drops from ~1-2 min (sqlite C compile)
to ~15s.
2. Fork commit 13777f1 handles the bootstrap paradox — upstream's
cargo_metadata subprocess fails on fresh checkouts because the workspace's
[patch.crates-io] points at target/patch/rustyms-0.11.0/ which doesn't
exist yet. The fork falls back to direct Cargo.toml parsing when metadata
resolution fails.
Install command specifies `patch-crate` package because the fork added a
second binary crate (broken-patch-fixture) for regression testing; without
the name, `cargo install --git` errors with "multiple packages with
binaries found".
Also strip the binary `gnome.dat` deletion from the patch file. `git apply`
rejected "Binary files differ" without --binary delta, which broke
`task setup`. Only the ontology.rs text patch is needed: removing the
include_bytes!("databases/gnome.dat") call prevents the 60MB file from
being baked into the binary regardless of whether it lingers on disk in
target/patch/.
Verified end-to-end:
- cargo uninstall patch-crate && rm -rf target/patch/
- task setup -> installs fork (13s), applies patch
- cargo build --release --bin timsseek -> 25.1 MB
- Extract init_tracing() from run(), dropping run() from ~340 to ~260 lines. Instrumentation-feature flush guard is held via a TracingHandle struct whose lifetime spans the caller scope — Box<dyn Any> erases the tracing-profile guard type so the outer scope doesn't need the feature gate. - Add deny_unknown_fields to every SpeclibBuildConfig sub-struct so typos in a speclib-build TOML error out instead of being silently ignored — matches timsseek_cli's config drift discipline. - docs/development.md: first-time-setup section pointing at `task setup` so fresh clones know to run it before `cargo build`. - timsseek_cli Cargo.toml: comment updated to reflect that build.rs automation remains impossible — the jspaezp/cargo-patch-crate fork fixes the tool's own bootstrap paradox but cargo still reads [patch.crates-io] before any build script runs.
`Path::exists()` and `File::open()` don't recognise `~`; paths pulled from config files or unquoted CLI args kept the literal `~/` prefix, so validate_inputs rejected real files with "Speclib file does not exist" even though they were on disk. Add `tims_stage::expand_local_uri(&str) -> String` which replaces `~/` / bare `~` with `$HOME` for local paths, passes remote URIs through unchanged. Apply at the top of validate_inputs to canonicalise speclib, calib_lib, raw_inputs, and output_uri once — every downstream consumer (existence probes, staging, uploads) sees the expanded form. Unit test covers tilde-prefix, bare tilde, absolute path, relative path, and remote URI pass-through.
…output URI
Three user-visible CLI fixes.
1. Tracing routing. Before: vanilla `timsseek` invocations with no
--log-path and no local output dir dumped all tracing spans to the
terminal. Now: tracing always goes to a file unless `--log-path -`
explicitly opts in to stderr. Resolution order:
--log-path - -> stderr
--log-path <p> -> <p>
local output.uri set -> <output>/timsseek-<ts>.log
remote/no output -> ./timsseek-<ts>.log in CWD
Timestamp (%Y%m%dT%H%M%S, local time via chrono::Local) prevents
clobbering repeat runs in the same directory. chrono added as a
direct dep on timsseek_cli (default-features = false, ["clock"]);
already in the dep tree transitively via aws.
2. Per-file wall-clock footer. The batch loop now prints
=== [N/M] sample ===
... phase output ...
Output: <dest>
=== [N/M] sample done in <duration> ===
so users see total time per input even when multiple files are
processed. Failure paths (upload abort, I/O error) print a matching
failed footer.
3. Output URI. `run_pipeline` used to `println!("Output: ...")` with the
local working path. For remote outputs, that path was a tempdir that
got cleaned up immediately after upload — users saw `/tmp/...` for
files actually landing at s3://... Now OutputSink exposes
`dest_uri_for_sample` (final destination, not working path); main.rs
per-file footer prints that URI. processing.rs drops the misleading
Output: line. finalize_sample was also unified on the helper, pulling
the `format!("{}/{}/{}", ...)` inline duplicate out.
…ript
Why: want to visualize where wall time is spent across phases, per file, and
per finalize stage. Existing text log has span context but no enter/close
events and no durations, so reconstructing a Gantt needed guessing.
Changes:
- enable tracing-subscriber "json" feature
- init_tracing: alongside the text log, open <log>.spans.jsonl as a JSON fmt
layer with FmtSpan::NEW|CLOSE so every span emits an open + a close event
carrying time.busy / time.idle; skipped when --log-path -
- add info_span!("file", name=<basename>) wrapping process_single_file so
multi-file runs render as distinct lanes
- wrap run-report write + sink.finalize_run in TimedStep("Finalize run") so
the tail of a run appears as its own span
- scripts/plot_spans.py: parse spans.jsonl, match NEW/CLOSE per
(thread, name), draw Gantt by (thread, depth) lane, label with
duration + busy%. --min-ms filters out sub-noise spans. Optional
--wandb RUN_NAME uploads the PNG as a wandb.Image.
Rayon workers still show as roots (tracing current-span is thread-local).
Worker parent attach deferred until a specific hot path needs it.
Switch QueryItemToScore.digest and PeptideMetadata.digest from ProteinSlice to Peptide (raw Arc<str> + Option<ParsedSequence> + decoy fields). Add apply_parse_gate() that zeros parsed on every entry if any entry fails to parse, sets Speclib.meta accordingly. Gate runs after all decoy generation at both loader entry points (from_file, from_file_with_format). Add test_parse_gate_off_on_poisoned_row and test_parse_gate_on_when_all_parsed.
…equence Replace `ScoringFields.sequence: String` with `peptide: Peptide`. Builder setter clones the digest directly (no string conversion). Add `Serialize` impl on `Peptide` (serializes as raw string) to satisfy the `#[derive(Serialize)]` on `ScoringFields`. Parquet column name kept as "sequence" for downstream compatibility; accessor swapped to `r.scoring.peptide.as_str()`. processing.rs sort/dedup sites use `.as_str().cmp()` / `==` comparisons.
Refactor CompetedCandidate::as_feature into paired named_features() -> Vec<(f64, &'static str)> as a single source of truth for values and names. Expose feature_names() for introspection. Append 22 dims (peptide_length + 20 AA counts + peptide_n_mods) when peptide.parsed.is_some(); the gate is speclib-wide so vector length is stable within a fit. Add AA_COUNT_NAMES const aligned with CANONICAL_AA_LETTERS output order. Four new feature_tests confirm lengths match, gate-on/off presence, and delta is exactly 22.
Thread parsable_sequences: bool through ResultParquetWriter::new, write it as a KeyValue entry in WriterProperties; pass speclib.meta.parsable_sequences from execute_pipeline. Adds full read-back unit test via SerializedFileReader.
After rescoring, compute per-fold feature means and Forust Gain importance via CrossValidatedScorer::feature_stats(). rescore() now returns (Vec<FinalResult>, RescoreFeatureStats). The CLI writes results.feature_stats.json next to results.parquet by default; suppress with --no-feature-stats.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Three concurrent threads landing as v0.31.0:
S3 staging (tims_stage crate)
tims_stagecrate: URI resolution,open_reader,upload_file, staging backend, tar walker, prefix materializer, sweep.timsseek_cliandspeclib_build_cliaccept S3 URIs for inputs and outputs.timscentroid;TimedStepmoved down a layer.load_benchexample).awsis now a default cargo feature.Peptide / ParsedSequence types
Peptide+ParsedSequence;parse_sequencebridges rustyms.normalize_to_proformahandles DIA-NN / short-form strings (shared).ScoringFieldscarriesPeptide, notString).modified_peptideforPeptide.raw.parsable_sequencesflag emitted into parquet metadata.DigestSlice->ProteinSlicerename; dead decoy helpers /NonReversedDecoyremoved.Feature stats + tracing
init_tracingsplit; tracing to file by default, per-file timing, real output URI.scripts/plot_spans.pyfor gantt plots.~expansion in local URIs before filesystem probes.Release / build
0.30.0->0.31.0(Cargo + pyproject).gnome.datstub via patch-crate (jspaezp fork),strip = "symbols",arrowdefault features dropped.timsquery_pyo3extension-module behind a cargo feature (fixes downstream linkage).Test plan
cargo build(default +--no-default-features) green on HEADb71e54av0.31.0after merge