Skip to content

perf(weave): defer output _save_nested_objects in client#6740

Draft
gtarpenning wants to merge 4 commits intomasterfrom
gtarpenning/save-nested-defer-structural
Draft

perf(weave): defer output _save_nested_objects in client#6740
gtarpenning wants to merge 4 commits intomasterfrom
gtarpenning/save-nested-defer-structural

Conversation

@gtarpenning
Copy link
Copy Markdown
Member

@gtarpenning gtarpenning commented Apr 30, 2026

TL;DR

Cuts per-op calling-thread CPU ~5× in async tracing (p50 stall 176ms → 38ms in heavy-GIL bench) by deferring output-side _save_nested_objects + map_to_refs + to_json from finish_call into the worker pool. A structural fix to _save_object_basic / _save_table keeps a naive defer from deadlocking on thread-pool starvation.

Builds on #6738 (the conservative subset). Linked: WB-33844.

How the work moves off the calling thread

master:
   ┌─ calling thread ──────────────────────────┐    ┌─ worker ──────────┐
   │ finish_call:                              │    │ send_end_call:    │
   │   _save_nested_objects(output)   <-- walk │    │   redact          │
   │   map_to_refs(output)            <-- walk │    │   to_json         │
   │   defer(send_end_call)  ──────────────────┼──► │   server.call_end │
   │   return                                  │    └───────────────────┘
   └───────────────────────────────────────────┘

this PR:
   ┌─ calling thread ──────────┐    ┌─ worker: schedule_send ─────────────┐
   │ finish_call:              │    │   _save_nested_objects(output)      │
   │   defer(schedule_send) ───┼──► │   map_to_refs(output)               │
   │   return                  │    │   pending = collect_pending_digests │
   └───────────────────────────┘    │   if pending:                       │
                                    │     then(pending, finalize_send) ───┼─► continuation
                                    │     # worker returns to pool        │   runs after
                                    │   else:                             │   digests resolve
                                    │     finalize_send(...)              │
                                    └─────────────────────────────────────┘
                                          │
                                          ▼ finalize_send: redact + to_json + server.call_end

What changes

  1. Output walks deferred in finish_call. _save_nested_objects, map_to_refs, redact, and to_json move into the worker pool. call.output is still set sync, so user code reading it after finish_call returns is unchanged.
  2. Save sites gate on pending digests. _save_object_basic, _save_table, and the chunked-table paths now go through one _defer_after_pending_digests helper that pre-walks the value and uses then(pending, send) instead of defer(send). The worker registering the callback returns to the pool instead of blocking inside to_json -> digest.result().
  3. send_end_call split into Phase 1 (schedule_send — walks + collect_pending) and Phase 2 (finalize_send — redact + to_json + call_end). A barrier end_complete_future is tracked in the executor so _flush() waits for the whole pipeline; log_exception=True keeps the same "Task failed" log line existing resilience tests assert on.

Calling-thread contract (unchanged)

  • User postprocess_output and _on_finish_handler still raise on the calling thread.
  • call.output is set to the postprocessed value before finish_call returns.
  • call.summary is computed and set before finish_call returns.
  • Production worker-pool defaults unchanged.

Risks

  • Refs attach asynchronously after finish_call returns. Today (master) refs attach before finish_call returns. User code that synchronously inspects output.ref immediately after finish_call may briefly see no ref attached. Behavior change; not expected to affect any current pattern but called out.
  • Larger surface than perf(weave): reduce per-op overhead in async tracing #6738. Touches _save_object_basic, _save_table, both chunked-table paths, and finish_call.
  • _track_future is private. Used deliberately to register the barrier future; will need an update if the futures module evolves.

Bench

Heavy GIL pressure (1460 ops/step, 30 steps). Median per-step is the stable comparison; avg is dominated by occasional GC stalls under heavy GIL.

Branch Median step p50 stall Max stall
Master ~1.0s 176ms 1319ms
Internal-only (#6738) ~1.0s 170ms 649ms
This PR ~0.4s 38ms ~3s*

* Max under heavy GIL is dominated by occasional consecutive bad steps (a 9.54s outlier appears once in the 30-step run, then the system recovers). Median/p50 are the meaningful numbers — calling-thread time per op is cut roughly 5×.

Tests

tests/trace/test_save_nested_defer.py (16 tests):

  • 12 unit tests for _collect_pending_digests covering primitives, nested containers, ObjectRef/OpRef/TableRef (resolved + pending), TableRef with both digest and row_digests pending, ObjectRecord, cycles, and resolved-future short-circuit.
  • 4 integration tests against a paused-server fixture: single-op finish doesn't deadlock; 20 concurrent async finishes drain via the then-chain; a real call_end failure surfaces as "Task failed" via _track_future (regression guard for log_exception=True); finish_call returns before the deferred work runs.

Existing suites green:

  • Local: trace shard (839 passed), trace_calls_complete_only shard (839 passed).
  • Local: test_evaluation_performance 5× consecutive at default parallelism (the deadlock fixture; would hang without the structural gate).
  • Local: test_tracing_resilience (29/29).

Test plan

  • trace and trace_calls_complete_only shards green
  • Deadlock fixture (test_evaluation_performance) green at default parallelism, 5× consecutive
  • Bench under heavy GIL, 30 steps. p50 stall 38ms vs master 176ms.
  • CI sweep across trace, trace_calls_complete_only, integrations, Windows
  • Hand to MRE author David Corbitt for prod repro

Related

Deep dive: why naive defer deadlocks (and how the gate fixes it)

Part 1 alone — just deferring the output walks, no gate — is a one-line change but reproduces a real deadlock against test_evaluation_performance::paused_client:

_save_object_basic's send_obj_create calls to_json(parent_with_refs), which synchronously waits on each nested ref's digest_future.result(). Under high concurrency every main-pool worker can land in to_json at once, each blocked on a child whose obj_create is queued behind it in the same executor — classic thread-pool starvation. Master happens to escape by ~1 worker of timing headroom; the deferral consumes it.

Part 2 — this PR — restructures the dispatch so workers don't synchronously block on digest futures. At every save site that calls to_json(value_with_refs), we pre-walk value for unresolved digest futures via _collect_pending_digests. If any are pending, we register the save via future_executor.then(pending, send) instead of defer(send). The worker that wanted to schedule the save attaches a callback and returns to the pool; the continuation runs once digests resolve, by which point to_json is non-blocking. No worker holds a slot waiting for another worker.

_collect_pending_digests walks any value (primitives, dicts, lists, tuples, ObjectRecord, Refs) and collects unresolved digest / row-digest futures. Cycle-safe via seen set, skips done futures, treats Refs as leaves (does not recurse into their fields).

…t gating

Two-part change. Together they cut per-op overhead on the calling thread
(asyncio loop in async workloads) without introducing the latent thread
pool starvation that an earlier draft (#6738 history) tripped on.

## Part 1 — defer the output tree walks into the worker

`_save_nested_objects(output)` and `map_to_refs(output)` were running on the
calling thread before `finish_call` returned. For ops with nested weave
Objects/Tables in the output, this dominates per-op CPU on the asyncio
loop. They are now deferred into the existing `send_end_call` future.
`call.output` is still set sync so user code reading it immediately after
`finish_call` returns still observes the postprocessed value (refs may
attach a moment later).

## Part 2 — structural digest gating in _save_object_basic / _save_table

Without this, deferring the walks reproduces a real deadlock:
`_save_object_basic` does `defer(send_obj_create)`, where `send_obj_create`
calls `to_json(parent_with_refs)`. `to_json` synchronously waits on each
nested ref's `digest_future.result()`. Under high concurrency every main-pool
worker can land in `to_json` at once, each blocked on a child whose
`obj_create` is queued *behind* it in the same executor — classic thread
pool starvation. Master happens to escape by ~1 worker of timing headroom;
the deferral consumes it. `test_evaluation_performance::paused_client`
makes this deterministic.

Fix: add `_collect_pending_digests(val)` which walks `val` and returns any
unresolved digest/row-digest futures held by nested Refs. Then in the save
sites:

- `_save_object_basic`: after `map_to_refs`, collect pending digests; if any,
  schedule `send_obj_create` via `future_executor.then(pending, ...)` instead
  of `defer(...)`. The worker registers a callback and returns to the pool;
  the continuation fires once digests resolve, by which point `to_json` is
  non-blocking. Same idea applied to `_save_table` (simple path),
  `_create_table_with_parallel_chunks` (per-chunk), and the first chunk of
  `_create_table_with_incremental_updates`.

- `finish_call`: split `send_end_call` into Phase 1 (`schedule_send`: walks
  output, builds `output_as_refs`, collects pending digests) and Phase 2
  (`finalize_send`: redact + `to_json` + `server.call_end`). If there are
  pending digests, Phase 2 is chained via `then(pending, finalize_send)`;
  the Phase 1 worker returns immediately. A barrier `end_complete_future`
  is tracked in the executor's active set (`_track_future`) so `_flush()`
  waits for the full pipeline, and `on_end_complete` fires exactly once
  on actual call_end completion.

## Tests

`tests/trace/test_save_nested_defer.py`:
- 12 unit tests for `_collect_pending_digests` covering primitives,
  containers, ObjectRef/OpRef/TableRef (with and without resolved digests
  and row_digests), nested ObjectRecord, cycles, dict/list/tuple walks.
- 4 integration tests against a paused-server fixture: single op finish
  doesn't deadlock; 20 concurrent async finishes drain via the then-chain;
  call_end errors still log (regression guard for `log_exception=True`);
  `finish_call` returns before the deferred work runs.

Existing suites pass with no modification:
- Local: `trace` shard (839 passed, 0 failed).
- Local: `trace_calls_complete_only` shard (839 passed, 0 failed).
- Local: `test_evaluation_performance::test_evaluation_performance` 5x in
  a row at default parallelism (the deadlock fixture; would hang without
  the structural gate).
- Local: `test_evaluation_performance::test_evaluation_resilience` —
  exception logging path still produces 4 unique error log keys
  (file_create, obj_create, table_create, call_end), not 3.
- Local: `test_tracing_resilience.py` (29/29 passed).

## Bench (heavy GIL, 1460 ops/step, 30-step median per-step)

| Branch                          | Median step | p50 stall | Notes |
|---------------------------------|------------:|----------:|-------|
| Master                          |       ~1.0s |     176ms | baseline |
| Internal-only (no defer, #5/6/7)|       ~1.0s |     170ms | #6738 |
| **This PR (defer + gating)**    |       ~0.4s |      38ms | |

The headline win is `p50 stall = 38ms` — the calling thread spends
~5x less time in tracing per op. Step-time average is dominated by
occasional GC / mempool stalls in the 30-step long run; median is the
stable comparison.
Drop unused `noqa: E731` (replaced lambda with named def) and apply ruff
format to the new test file.
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 30, 2026

Codecov Report

❌ Patch coverage is 90.27778% with 7 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
weave/trace/weave_client.py 90.27% 5 Missing and 2 partials ⚠️

📢 Thoughts on this report? Let us know!

gtarpenning and others added 2 commits April 30, 2026 09:22
mypy 1.19.1 can't infer the lambda's signature in
`_create_table_with_parallel_chunks` because it's passed positionally to
`then(...)` whose `g` is a generic `Callable[[list[T]], U]`. Named def
with explicit annotations fixes the "Cannot infer type of lambda"
diagnostic.
- Extract _defer_after_pending_digests helper to remove the four-place
  duplication of the pending-digest dispatch in _save_object_basic,
  _save_table, and the chunked-table paths.
- Narrow `except BaseException` to `except Exception` in finalize_send
  and schedule_send so KeyboardInterrupt/SystemExit propagate normally.
- Hoist `import asyncio`/`import logging` to the top of the test file.
- Replace the smoke-test-only test_call_end_error_is_logged_once with a
  real regression guard: inject a call_end failure and assert the
  "Task failed" log lands; mark with disable_logging_error_check.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@gtarpenning gtarpenning changed the title feat(weave): defer output _save_nested_objects with structural digest gating perf(weave): defer output _save_nested_objects with structural digest gating Apr 30, 2026
@gtarpenning gtarpenning changed the title perf(weave): defer output _save_nested_objects with structural digest gating perf(weave): defer output _save_nested_objects Apr 30, 2026
@gtarpenning gtarpenning changed the title perf(weave): defer output _save_nested_objects perf(weave): defer output _save_nested_objects in client Apr 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant