feat(client): resumable single-node (non-merkle) uploads (per-wave cached proofs)#88
feat(client): resumable single-node (non-merkle) uploads (per-wave cached proofs)#88grumbach wants to merge 5 commits into
Conversation
… from cached per-wave proofs) Mirrors PR WithAutonomi#84's merkle-resume design for the regular payment path: persist each wave's per-chunk PaymentProof bytes to disk after batch_pay confirms, before the wave's PUT phase. If the upload dies mid-file (network flake, slow close-K, client crash, Ctrl-C), the next attempt loads the cached proofs and skips quote+pay for any chunk whose address matches the current encryption. Storage layout: <data_dir>/payments/single/<ts>_<file_hash> via the new ant-core::data::client::cached_single module. Subdirectory keeps single-node and merkle caches from colliding on filename. Receipts expire after 24h to match QUOTE_MAX_AGE_SECS in ant-node. Threaded a resume_key: Option<&str> through batch_upload_chunks_with_events and upload_waves_single so callers that don't have a stable file identity (the direct batch API at the public surface) opt out by passing None. Wave-level append rather than full rewrite per chunk so the cost of caching scales with chunk count, not chunk count squared. Failure-mode tolerance: every save/load/delete is wrapped in try_* that logs but never bubbles up — a busted cache never blocks a real upload. Asked by Nic on PR WithAutonomi#84 follow-up.
Adversarial review of WithAutonomi#88 found a dozen money-loss paths in the single-node resume cache. This is the consolidated fix. Atomicity & concurrency - write_receipt_atomic: <path>.tmp + BufWriter::into_inner check + flush + sync_all + rename + parent-dir fsync. Replaces the prior truncate-then-write that lost paid waves on crash or concurrent CLI. - ReceiptLock: per-key fs2 exclusive lock on a .lock sidecar guards append/drop/delete. Two concurrent ant-file-upload invocations on the same path now serialize at the receipt boundary instead of last-writer-wins on the proof set. - recover_orphaned_tmps: under the lock, recover the newest readable <ts>_<key>.tmp left by a crash between sync_all and rename. Unlink older .tmp siblings (their content is a subset by the load-extend- write invariant) and any corrupt .tmp. - dedupe_canonical_receipts: pick newest canonical by ts-prefix and unlink older siblings. Prevents the non-deterministic resolution that first-match iteration would yield. Stale-proof handling (no remote text trust) - prune_locally_expired_proofs in batch.rs decodes each cached PaymentProof and drops entries whose quote.timestamp is past the storer-side QUOTE_MAX_AGE_SECS budget minus a 5-minute safety margin. Replaces the prior substring match on storer error text, which a Byzantine peer could spoof to force double-payment. - drop_proofs_for_file now takes &[(addr, expected_bytes)] and does compare-and-swap on the on-disk bytes. A concurrent re-pay's fresh proof is never clobbered by a stale prune list computed earlier. Schema, cost, and key stability - SingleNodeReceipt gains a version: u8 field. read_receipt rejects versions above SCHEMA_VERSION = 1 as unreadable. - storage_cost_atto now sums as ant_protocol::evm::Amount (U256) instead of u128, so very large uploads don't silently saturate. - find_existing's unreadable arm unlinks the corrupt file instead of letting it occupy the directory for 24 h. - file.rs canonicalizes the cache key (with display-string fallback) so ./foo and /abs/foo hit the same receipt. - batch.rs seeds total_storage/total_gas from the loaded receipt so the returned tally reflects this-file-total, not just freshly-paid. - delete_for_file also unlinks matching .tmp residue so a future upload of the same path can't resurrect a deleted receipt. Tests - 24 new tests in cached_single + 5 in batch covering atomicity, lock exclusion (2/8 reproducible regression without the lock), tmp recovery + dedupe, CAS-on-bytes drop, schema-version rejection, cost-overflow safety, canonical dedupe, unreadable auto-unlink, and concurrent drop+append consistency. - proof_is_safely_fresh tests for the local pre-flight expiry check. Verification: 280 lib tests pass, clippy -D warnings clean, fmt clean, release build clean.
CI uses Rust 1.95 which promotes clippy::unnecessary_sort_by to deny; local was 1.94 and didn't fire. Rewrite the two descending-sort calls in recover_orphaned_tmps and dedupe_canonical_receipts to use sort_by_key + std::cmp::Reverse (the lint's suggested form. EOF )
Code reviewTL;DRStrong, well-tested PR. The on-disk cache module is essentially production-grade: atomic writes, advisory locking, CAS-on-drop for TOCTOU, schema versioning, orphaned-tmp recovery, dedup of canonicals, U256 cumulative cost summing. Tests cover the actual failure modes (concurrent appends, torn Correctness verification I ran
Things I like
Questions / things to confirm before mergeQ1. Storer acceptance of a stale proof against a re-quoted close group. Q2. Sync I/O on the tokio runtime. Smaller things
What I didn't verify
VerdictApprove with the question on storer acceptance answered. Two follow-ups worth filing:
|
…review 1. Receipt filename TTL was keyed on the FIRST wave's timestamp. A wave paid at T0+23h50m got dropped wholesale at T0+24h via cleanup_outdated even though the late wave's proof was only 10 minutes old. Fix: rotate the canonical filename to <now>_<key> on every successful append_wave, so the on-disk TTL tracks "time since most recent paid wave" instead of "time since first wave". The receipt survives as long as it keeps being used; stale individual proofs are still pruned by the per-quote.timestamp check in batch.rs. 2. proof_is_safely_fresh rejected ANY future-dated quote, but the ant-node verifier tolerates up to QUOTE_FUTURE_SKEW_TOLERANCE_SECS = 300s of forward skew. A client clock 60s slow would prune perfectly fresh proofs and force re-payment. Fix: mirror the 300s tolerance in CACHED_PROOF_FUTURE_SKEW_TOLERANCE_SECS and thread max_future_skew through proof_is_safely_fresh. 3. file_hash_key used std::collections::hash_map::DefaultHasher whose output is explicitly NOT stable across rustc releases. User pays on binary A, upgrades within 24h, retries on binary B, cache miss, re-pay. Fix: BLAKE3 of the canonical path string, truncated to 128 bits. Adds blake3 = "1" to ant-core. 4. dedupe_canonical_receipts was content-blind: it unlinked older siblings purely by timestamp without merging their proofs. Pre-existing duplicates from buggier earlier binaries or manual file recovery could hold proofs only in the older sibling. Blind unlink stranded those payments. Fix: union all readable siblings' proofs into the newest, sum costs, take min(first_pay_timestamp), atomically rewrite the winner, then unlink the rest. 3 new tests: - file_hash_key_uses_stable_digest_across_invocations pins the expected BLAKE3 digest so a future regression to a non-stable hash fails loudly. - append_wave_rotates_filename_so_late_waves_dont_age_out verifies the canonical filename's timestamp tracks the latest wave (not the first) and that proofs survive the rotation. - duplicate_canonical_receipts_are_merged_then_older_unlinked replaces the prior lossy test, asserts older sibling's proof is merged into the winner and costs are summed correctly. Verification: 283 lib tests pass (was 280; +3), clippy -D warnings clean, fmt clean, release build clean.
Summary
Mirrors PR #84 (resumable merkle) for the regular (single-node, non-merkle) payment path. Persists each wave's per-chunk
PaymentProofbytes to disk afterbatch_payconfirms, before the wave's PUT phase. If the upload dies mid-file (network flake, slow close-K, client crash, Ctrl-C), the next attempt loads the cached proofs and skips quote + pay for any chunk whose address matches the current encryption.Asked by Nic on the PR #84 follow-up: "is this already in for regular payments?" — answer was no, this PR fixes that.
Why
Single-node uploads break the file into payment waves. Each wave is one EVM transaction that produces N per-chunk payment proofs. Before this change, those proofs lived only in process memory. A partial-upload failure meant every wave already paid for was unrecoverable — the user had to re-quote and re-pay every chunk on the next attempt.
Live merkle uploads on prod last week burned ~2.78 ANT on a single failed 730 MB attempt because the merkle path had the same disease; PR #84 fixed it. This PR closes the parallel hole for regular uploads, which is what most files <64 chunks use.
Design
New
ant-core/src/data/client/cached_singlemodule:try_append_wave(file_path, new_proofs, storage_cost, gas_cost)— called once per successfully paid wave, before the PUT phase. Adds the wave's(addr, proof_bytes)entries to the on-disk receipt and updates cumulative cost figures. The whole receipt is rewritten on each append (bounded by chunk count).try_load_for_file(file_path)— called once at the top of the upload. Returns the cached receipt (if any).try_delete_for_file(file_path)— called after full-file success.cleanup_outdated()— opportunistic GC on every load.Storage layout:
<data_dir>/payments/single/<ts>_<file_hash>. Thesingle/subdirectory keeps this cache from colliding on filename withcached_merkle(<data_dir>/payments/).Expiry: 24 h, matching
QUOTE_MAX_AGE_SECSinant-node. After that, storers reject the cached proof even if the file is otherwise resumable, so keeping the cache wouldn't help.The wire-up:
batch_upload_chunks_with_eventsgains aresume_key: Option<&str>parameter.None(used by direct callers without a stable file identity) opts out of caching.PaidChunkbuilt from cached proof + freshly-quoted peers) and "needs payment" (sent tobatch_pay). Cached chunks bypass the EVM entirely.batch_pay, the newly paid wave's proofs are appended to the on-disk receipt.file.rs::upload_with_optionsdeletes the cache after full file success on both the merkle-fallback path and the regular path.Failure-mode tolerance
All public-facing API (
try_*variants) swallows IO and serialization errors with awarn!log. A busted cache directory must never prevent a real upload from running. At worst, the user re-pays.Tests
file_hash_keystability, expired/fresh filename detection, malformed filename safety, multi-wave roundtrip save → load → delete with cumulative cost summing.cargo build --release --bin antclean.What this does NOT do
AlreadyExistscheaply for already-stored chunks, but the PUT request still goes over the wire. A future iteration could persist astored: HashSet<XorName>alongside the receipt.QUOTE_MAX_AGE_SECSwindow. Cache files past that are GC'd automatically.Note on caller surface
Client::batch_upload_chunkskeeps its existing signature (noresume_key) — the public direct-batch API doesn't have a stable file identity to key on, so it stays opt-out by default. File-level callers (upload_with_options) pass the file path string as the key.