perf(weave): defer output _save_nested_objects in client by gtarpenning · Pull Request #6740 · wandb/weave

gtarpenning · 2026-04-30T16:11:41Z

TL;DR

Cuts per-op calling-thread CPU ~5× in async tracing (p50 stall 176ms → 38ms in heavy-GIL bench) by deferring output-side _save_nested_objects + map_to_refs + to_json from finish_call into the worker pool. A structural fix to _save_object_basic / _save_table keeps a naive defer from deadlocking on thread-pool starvation.

Builds on #6738 (the conservative subset). Linked: WB-33844.

How the work moves off the calling thread

master:
   ┌─ calling thread ──────────────────────────┐    ┌─ worker ──────────┐
   │ finish_call:                              │    │ send_end_call:    │
   │   _save_nested_objects(output)   <-- walk │    │   redact          │
   │   map_to_refs(output)            <-- walk │    │   to_json         │
   │   defer(send_end_call)  ──────────────────┼──► │   server.call_end │
   │   return                                  │    └───────────────────┘
   └───────────────────────────────────────────┘

this PR:
   ┌─ calling thread ──────────┐    ┌─ worker: schedule_send ─────────────┐
   │ finish_call:              │    │   _save_nested_objects(output)      │
   │   defer(schedule_send) ───┼──► │   map_to_refs(output)               │
   │   return                  │    │   pending = collect_pending_digests │
   └───────────────────────────┘    │   if pending:                       │
                                    │     then(pending, finalize_send) ───┼─► continuation
                                    │     # worker returns to pool        │   runs after
                                    │   else:                             │   digests resolve
                                    │     finalize_send(...)              │
                                    └─────────────────────────────────────┘
                                          │
                                          ▼ finalize_send: redact + to_json + server.call_end

What changes

Output walks deferred in finish_call. _save_nested_objects, map_to_refs, redact, and to_json move into the worker pool. call.output is still set sync, so user code reading it after finish_call returns is unchanged.
Save sites gate on pending digests. _save_object_basic, _save_table, and the chunked-table paths now go through one _defer_after_pending_digests helper that pre-walks the value and uses then(pending, send) instead of defer(send). The worker registering the callback returns to the pool instead of blocking inside to_json -> digest.result().
send_end_call split into Phase 1 (schedule_send — walks + collect_pending) and Phase 2 (finalize_send — redact + to_json + call_end). A barrier end_complete_future is tracked in the executor so _flush() waits for the whole pipeline; log_exception=True keeps the same "Task failed" log line existing resilience tests assert on.

Calling-thread contract (unchanged)

User postprocess_output and _on_finish_handler still raise on the calling thread.
call.output is set to the postprocessed value before finish_call returns.
call.summary is computed and set before finish_call returns.
Production worker-pool defaults unchanged.

Risks

Refs attach asynchronously after finish_call returns. Today (master) refs attach before finish_call returns. User code that synchronously inspects output.ref immediately after finish_call may briefly see no ref attached. Behavior change; not expected to affect any current pattern but called out.
Larger surface than perf(weave): reduce per-op overhead in async tracing #6738. Touches _save_object_basic, _save_table, both chunked-table paths, and finish_call.
_track_future is private. Used deliberately to register the barrier future; will need an update if the futures module evolves.

Bench

Heavy GIL pressure (1460 ops/step, 30 steps). Median per-step is the stable comparison; avg is dominated by occasional GC stalls under heavy GIL.

Branch	Median step	p50 stall	Max stall
Master	~1.0s	176ms	1319ms
Internal-only (#6738)	~1.0s	170ms	649ms
This PR	~0.4s	38ms	~3s*

* Max under heavy GIL is dominated by occasional consecutive bad steps (a 9.54s outlier appears once in the 30-step run, then the system recovers). Median/p50 are the meaningful numbers — calling-thread time per op is cut roughly 5×.

Tests

tests/trace/test_save_nested_defer.py (16 tests):

12 unit tests for _collect_pending_digests covering primitives, nested containers, ObjectRef/OpRef/TableRef (resolved + pending), TableRef with both digest and row_digests pending, ObjectRecord, cycles, and resolved-future short-circuit.
4 integration tests against a paused-server fixture: single-op finish doesn't deadlock; 20 concurrent async finishes drain via the then-chain; a real call_end failure surfaces as "Task failed" via _track_future (regression guard for log_exception=True); finish_call returns before the deferred work runs.

Existing suites green:

Local: trace shard (839 passed), trace_calls_complete_only shard (839 passed).
Local: test_evaluation_performance 5× consecutive at default parallelism (the deadlock fixture; would hang without the structural gate).
Local: test_tracing_resilience (29/29).

Test plan

trace and trace_calls_complete_only shards green
Deadlock fixture (test_evaluation_performance) green at default parallelism, 5× consecutive
Bench under heavy GIL, 30 steps. p50 stall 38ms vs master 176ms.
CI sweep across trace, trace_calls_complete_only, integrations, Windows
Hand to MRE author David Corbitt for prod repro

_save_object_basic's send_obj_create calls to_json(parent_with_refs), which synchronously waits on each nested ref's digest_future.result(). Under high concurrency every main-pool worker can land in to_json at once, each blocked on a child whose obj_create is queued behind it in the same executor — classic thread-pool starvation. Master happens to escape by ~1 worker of timing headroom; the deferral consumes it.

Part 2 — this PR — restructures the dispatch so workers don't synchronously block on digest futures. At every save site that calls to_json(value_with_refs), we pre-walk value for unresolved digest futures via _collect_pending_digests. If any are pending, we register the save via future_executor.then(pending, send) instead of defer(send). The worker that wanted to schedule the save attaches a callback and returns to the pool; the continuation runs once digests resolve, by which point to_json is non-blocking. No worker holds a slot waiting for another worker.

_collect_pending_digests walks any value (primitives, dicts, lists, tuples, ObjectRecord, Refs) and collects unresolved digest / row-digest futures. Cycle-safe via seen set, skips done futures, treats Refs as leaves (does not recurse into their fields).

…t gating Two-part change. Together they cut per-op overhead on the calling thread (asyncio loop in async workloads) without introducing the latent thread pool starvation that an earlier draft (#6738 history) tripped on. ## Part 1 — defer the output tree walks into the worker `_save_nested_objects(output)` and `map_to_refs(output)` were running on the calling thread before `finish_call` returned. For ops with nested weave Objects/Tables in the output, this dominates per-op CPU on the asyncio loop. They are now deferred into the existing `send_end_call` future. `call.output` is still set sync so user code reading it immediately after `finish_call` returns still observes the postprocessed value (refs may attach a moment later). ## Part 2 — structural digest gating in _save_object_basic / _save_table Without this, deferring the walks reproduces a real deadlock: `_save_object_basic` does `defer(send_obj_create)`, where `send_obj_create` calls `to_json(parent_with_refs)`. `to_json` synchronously waits on each nested ref's `digest_future.result()`. Under high concurrency every main-pool worker can land in `to_json` at once, each blocked on a child whose `obj_create` is queued *behind* it in the same executor — classic thread pool starvation. Master happens to escape by ~1 worker of timing headroom; the deferral consumes it. `test_evaluation_performance::paused_client` makes this deterministic. Fix: add `_collect_pending_digests(val)` which walks `val` and returns any unresolved digest/row-digest futures held by nested Refs. Then in the save sites: - `_save_object_basic`: after `map_to_refs`, collect pending digests; if any, schedule `send_obj_create` via `future_executor.then(pending, ...)` instead of `defer(...)`. The worker registers a callback and returns to the pool; the continuation fires once digests resolve, by which point `to_json` is non-blocking. Same idea applied to `_save_table` (simple path), `_create_table_with_parallel_chunks` (per-chunk), and the first chunk of `_create_table_with_incremental_updates`. - `finish_call`: split `send_end_call` into Phase 1 (`schedule_send`: walks output, builds `output_as_refs`, collects pending digests) and Phase 2 (`finalize_send`: redact + `to_json` + `server.call_end`). If there are pending digests, Phase 2 is chained via `then(pending, finalize_send)`; the Phase 1 worker returns immediately. A barrier `end_complete_future` is tracked in the executor's active set (`_track_future`) so `_flush()` waits for the full pipeline, and `on_end_complete` fires exactly once on actual call_end completion. ## Tests `tests/trace/test_save_nested_defer.py`: - 12 unit tests for `_collect_pending_digests` covering primitives, containers, ObjectRef/OpRef/TableRef (with and without resolved digests and row_digests), nested ObjectRecord, cycles, dict/list/tuple walks. - 4 integration tests against a paused-server fixture: single op finish doesn't deadlock; 20 concurrent async finishes drain via the then-chain; call_end errors still log (regression guard for `log_exception=True`); `finish_call` returns before the deferred work runs. Existing suites pass with no modification: - Local: `trace` shard (839 passed, 0 failed). - Local: `trace_calls_complete_only` shard (839 passed, 0 failed). - Local: `test_evaluation_performance::test_evaluation_performance` 5x in a row at default parallelism (the deadlock fixture; would hang without the structural gate). - Local: `test_evaluation_performance::test_evaluation_resilience` — exception logging path still produces 4 unique error log keys (file_create, obj_create, table_create, call_end), not 3. - Local: `test_tracing_resilience.py` (29/29 passed). ## Bench (heavy GIL, 1460 ops/step, 30-step median per-step) | Branch | Median step | p50 stall | Notes | |---------------------------------|------------:|----------:|-------| | Master | ~1.0s | 176ms | baseline | | Internal-only (no defer, #5/6/7)| ~1.0s | 170ms | #6738 | | **This PR (defer + gating)** | ~0.4s | 38ms | | The headline win is `p50 stall = 38ms` — the calling thread spends ~5x less time in tracing per op. Step-time average is dominated by occasional GC / mempool stalls in the 30-step long run; median is the stable comparison.

Drop unused `noqa: E731` (replaced lambda with named def) and apply ruff format to the new test file.

codecov · 2026-04-30T16:22:01Z

Codecov Report

❌ Patch coverage is 90.27778% with 7 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
weave/trace/weave_client.py	90.27%	5 Missing and 2 partials ⚠️

📢 Thoughts on this report? Let us know!

mypy 1.19.1 can't infer the lambda's signature in `_create_table_with_parallel_chunks` because it's passed positionally to `then(...)` whose `g` is a generic `Callable[[list[T]], U]`. Named def with explicit annotations fixes the "Cannot infer type of lambda" diagnostic.

- Extract _defer_after_pending_digests helper to remove the four-place duplication of the pending-digest dispatch in _save_object_basic, _save_table, and the chunked-table paths. - Narrow `except BaseException` to `except Exception` in finalize_send and schedule_send so KeyboardInterrupt/SystemExit propagate normally. - Hoist `import asyncio`/`import logging` to the top of the test file. - Replace the smoke-test-only test_call_end_error_is_logged_once with a real regression guard: inject a call_end failure and assert the "Task failed" log lands; mark with disable_logging_error_check. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

gtarpenning added 2 commits April 30, 2026 09:10

chore(weave): fix lint/format on save-nested-defer PR

cf6879e

Drop unused `noqa: E731` (replaced lambda with named def) and apply ruff format to the new test file.

gtarpenning and others added 2 commits April 30, 2026 09:22

gtarpenning changed the title ~~feat(weave): defer output _save_nested_objects with structural digest gating~~ perf(weave): defer output _save_nested_objects with structural digest gating Apr 30, 2026

gtarpenning changed the title ~~perf(weave): defer output _save_nested_objects with structural digest gating~~ perf(weave): defer output _save_nested_objects Apr 30, 2026

gtarpenning changed the title ~~perf(weave): defer output _save_nested_objects~~ perf(weave): defer output _save_nested_objects in client Apr 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(weave): defer output _save_nested_objects in client#6740

perf(weave): defer output _save_nested_objects in client#6740
gtarpenning wants to merge 4 commits intomasterfrom
gtarpenning/save-nested-defer-structural

gtarpenning commented Apr 30, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Apr 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gtarpenning commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TL;DR

How the work moves off the calling thread

What changes

Calling-thread contract (unchanged)

Risks

Bench

Tests

Test plan

Related

Uh oh!

codecov Bot commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

gtarpenning commented Apr 30, 2026 •

edited

Loading

codecov Bot commented Apr 30, 2026 •

edited

Loading