Skip to content

Split chunk GET timeout from store timeout#78

Merged
jacderida merged 7 commits into
mainfrom
fix/stability-improvements
May 13, 2026
Merged

Split chunk GET timeout from store timeout#78
jacderida merged 7 commits into
mainfrom
fix/stability-improvements

Conversation

@mickvandijke
Copy link
Copy Markdown
Contributor

Summary

  • Add an independent chunk_get_timeout_secs client config value and hidden --chunk-get-timeout-secs CLI flag.
  • Keep store_timeout_secs scoped to chunk store/PUT operations while preserving default config behavior unless a CLI override is provided.
  • Reduce the non-Merkle chunk PUT store-response timeout from 30 seconds to 10 seconds.

Testing

  • Not run; PR opened from the existing committed branch state.

@mickvandijke mickvandijke force-pushed the fix/stability-improvements branch 3 times, most recently from 8bcdc8b to cdb0945 Compare May 10, 2026 20:35
Copy link
Copy Markdown
Contributor

@grumbach grumbach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving the split — independent chunk_get_timeout_secs is the right shape and removes the implicit coupling between PUT and GET timeouts.

One thing to flag: the non-merkle STORE_RESPONSE_TIMEOUT drop from 30s to 10s in this PR is fine for the non-merkle path (fail fast + retry, as the description says), but the merkle path uses the same store_timeout_secs config and ends up bound by it.

I've just opened #83 which bumps the default store_timeout_secs to 270s = storer-side CLOSENESS_LOOKUP_TIMEOUT (240s) + 30s padding. That's what made merkle uploads pass at scale on a 210-node 7-region testnet (0/22 -> 8/8). The two PRs touch the same constant — order matters.

Either:

  • merge #78 first; my #83 rebases on top and bumps the default further.
  • merge #83 first; #78 rebases on top, keeps its non-merkle hardcoded value, picks up my new default elsewhere.

Approving this one, will reconcile #83 once one of us has landed. This is #5 in the merge chain (transport -> core -> protocol -> node -> client).

grumbach added a commit to grumbach/ant-client that referenced this pull request May 12, 2026
…anged

Adversarial review of the previous bulk timeout bump (270s for everyone) flagged that the chunk GET path at chunk.rs:296 also reads
store_timeout_secs. Bumping the shared field to 270s silently changed GET behavior too, which was not the intent.

This commit:
- Introduces a dedicated DEFAULT_MERKLE_STORE_TIMEOUT_SECS = 270 const
- Adds merkle_store_timeout_secs: u64 to ClientConfig (default 270)
- Routes only the merkle PUT path (store_response_timeout_for_proof) to the new field
- Leaves DEFAULT_STORE_TIMEOUT_SECS at 10 (matches current main behavior); the chunk GET path keeps reading store_timeout_secs unchanged
- Updates doc comments to be honest about what each knob actually governs (store_timeout_secs now governs only the GET path and any direct readers, not non-merkle PUTs which use the STORE_RESPONSE_TIMEOUT const)
- Strengthens the regression test to pin the invariant that non-merkle proof tags ignore the merkle timeout value

Coordinates with Mick's PR WithAutonomi#78 which adds a dedicated chunk_get_timeout_secs field. After both land, the three timeout regions
(merkle PUT / non-merkle PUT / GET) will be cleanly separated.
grumbach added a commit to grumbach/ant-client that referenced this pull request May 12, 2026
… limit

foundryup curls api.github.com to resolve the nightly tag. Anonymous
calls are rate-limited at 60/hour shared per IP; macOS runners hit this
regularly and fail every E2E and Merkle E2E job with
`curl: (56) ... 403`.

Passing the workflow's GITHUB_TOKEN authenticates the call, raising the
cap to 1000/hour per token. Same fix Mick's PR WithAutonomi#78 will want.
jacderida pushed a commit that referenced this pull request May 13, 2026
…anged

Adversarial review of the previous bulk timeout bump (270s for everyone) flagged that the chunk GET path at chunk.rs:296 also reads
store_timeout_secs. Bumping the shared field to 270s silently changed GET behavior too, which was not the intent.

This commit:
- Introduces a dedicated DEFAULT_MERKLE_STORE_TIMEOUT_SECS = 270 const
- Adds merkle_store_timeout_secs: u64 to ClientConfig (default 270)
- Routes only the merkle PUT path (store_response_timeout_for_proof) to the new field
- Leaves DEFAULT_STORE_TIMEOUT_SECS at 10 (matches current main behavior); the chunk GET path keeps reading store_timeout_secs unchanged
- Updates doc comments to be honest about what each knob actually governs (store_timeout_secs now governs only the GET path and any direct readers, not non-merkle PUTs which use the STORE_RESPONSE_TIMEOUT const)
- Strengthens the regression test to pin the invariant that non-merkle proof tags ignore the merkle timeout value

Coordinates with Mick's PR #78 which adds a dedicated chunk_get_timeout_secs field. After both land, the three timeout regions
(merkle PUT / non-merkle PUT / GET) will be cleanly separated.
jacderida pushed a commit that referenced this pull request May 13, 2026
… limit

foundryup curls api.github.com to resolve the nightly tag. Anonymous
calls are rate-limited at 60/hour shared per IP; macOS runners hit this
regularly and fail every E2E and Merkle E2E job with
`curl: (56) ... 403`.

Passing the workflow's GITHUB_TOKEN authenticates the call, raising the
cap to 1000/hour per token. Same fix Mick's PR #78 will want.
mickvandijke and others added 6 commits May 13, 2026 01:28
PUTs and GETs have different payload directions and performance
profiles, so a single shared timeout was a poor fit. Adds
`chunk_get_timeout_secs` to `ClientConfig` (default 10s) and a
matching `--chunk-get-timeout-secs` CLI flag, while keeping
`store_timeout_secs` for the PUT path. Also bumps the non-merkle
store-response timeout from 5s to 10s.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`chunk_put_with_proof` was deep-copying every chunk's content into a
fresh `Vec<u8>` via `Bytes::to_vec()` before stuffing it into a
`ChunkPutRequest`. Since `chunk_put_to_close_group` spawns one of these
per recipient (CLOSE_GROUP_MAJORITY ≈ 5) and the AIMD store controller
caps concurrent chunks at 64, peak in-flight chunk content reached
~64 × 5 × 4 MB = 2.5 GB just from these copies — enough to OOM-kill
`ant` on a 4 GB client VM partway through a 300 MB-or-larger upload.

Heaptrack against a clean-exit 20 MB upload (release `ant` on a 4 GB
VM) showed 285 MB peak in `alloc::raw_vec::RawVecInner::finish_grow`,
with 168 MB of that consumed via `ChunkMessage::encode` →
`chunk_put_to_close_group` — i.e. the per-recipient copy + encode loop.

ant-protocol now holds `ChunkPutRequest::content` as `bytes::Bytes`
(see jacderida/ant-protocol#1), so the caller
can pass the refcounted `Bytes` straight through. Each peer's spawned
task now shares the single 4 MB backing buffer instead of holding an
independent copy. Wire format is unchanged.

The patch.crates-io stanza pins ant-protocol to the perf branch commit
so the build resolves the Bytes-typed field. The pin should be removed
once a crates.io release containing the protocol change is published.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…anged

Adversarial review of the previous bulk timeout bump (270s for everyone) flagged that the chunk GET path at chunk.rs:296 also reads
store_timeout_secs. Bumping the shared field to 270s silently changed GET behavior too, which was not the intent.

This commit:
- Introduces a dedicated DEFAULT_MERKLE_STORE_TIMEOUT_SECS = 270 const
- Adds merkle_store_timeout_secs: u64 to ClientConfig (default 270)
- Routes only the merkle PUT path (store_response_timeout_for_proof) to the new field
- Leaves DEFAULT_STORE_TIMEOUT_SECS at 10 (matches current main behavior); the chunk GET path keeps reading store_timeout_secs unchanged
- Updates doc comments to be honest about what each knob actually governs (store_timeout_secs now governs only the GET path and any direct readers, not non-merkle PUTs which use the STORE_RESPONSE_TIMEOUT const)
- Strengthens the regression test to pin the invariant that non-merkle proof tags ignore the merkle timeout value

Coordinates with Mick's PR #78 which adds a dedicated chunk_get_timeout_secs field. After both land, the three timeout regions
(merkle PUT / non-merkle PUT / GET) will be cleanly separated.
When a merkle batch upload fails partway through (network flake, slow
close-K, client crash), the on-chain payment is lost but the proofs
needed to re-attempt the store are lost too — the user has to pay
again from scratch.

This change persists the MerkleBatchPaymentResult to disk
immediately after the on-chain payment confirms, then re-loads it on
the next upload of the same file path. The cache is keyed by a hash
of the source path; a successful upload deletes the cache, a partial
failure leaves it for the next attempt to pick up. Files older than
the on-chain payment expiration (7 days) are GC'd opportunistically.

The library handles save/load/delete transparently — no CLI flag and
no app-level change needed. If the cached receipt doesn't match the
current file content (file edited between attempts), the cache is
discarded and the user pays fresh.

Foundation laid by adding Serialize/Deserialize to
MerkleBatchPaymentResult and threading the on-chain payment
timestamp through.

The new module also handles its own failure modes defensively: any
IO/serialization error is logged but never bubbled up to break the
upload itself. Cache misses are silent.
@jacderida jacderida force-pushed the fix/stability-improvements branch from a92b3ae to f03b91b Compare May 13, 2026 00:51
`MerkleBatchPaymentResult.proofs` is keyed by `[u8; 32]`, which JSON
cannot represent as a map key — `serde_json::to_writer` errors with
"key must be a string", causing the roundtrip_save_load_delete unit
test to fail. Switch to rmp-serde (already a dep) for save/load.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jacderida jacderida merged commit 91d5f18 into main May 13, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants