From 2ec9341aae6242d75cfe1aa65f0fe254eabfdfa0 Mon Sep 17 00:00:00 2001
From: Matthew A Johnson <matthew@matthewajohnson.org>
Date: Tue, 28 Apr 2026 23:22:31 +0100
Subject: [PATCH] verona-rt
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Verona-RT-style work-stealing scheduler, C source split into per-subsystem
translation units, and a portable atomics / threading layer.

**New Features**

- **Work-stealing scheduler** — the single behavior queue has been
  replaced with a Verona-RT-inspired distributed scheduler. Each
  worker owns a Multi-Producer Multi-Consumer behavior queue
  (`boc_bq_*`, ported from `verona-rt/src/rt/sched/mpmcq.h`), pops
  work from its own queue first, and steals from peers when empty.
  Idle workers park on a per-worker condition variable and are
  signalled directly by the producer / victim, eliminating the
  central wakeup broadcast. Per-worker statistics (steals, parks,
  fast/slow pops, dispatches) are exposed for benchmarking.
- **Per-worker fairness tokens** — each worker advances a token node
  through its own queue so that long-running behaviors cannot
  monopolise dispatch slots. The token is also used to drive the
  cooperative shutdown handshake.
- **`compat.h` / `compat.c` portability layer** — a single header now
  exposes uniform `BOCMutex`, `BOCCond`, `boc_atomic_*_explicit`,
  monotonic-time, and sleep primitives across MSVC, pthreads, and
  C11 `<threads.h>`. The work-stealing scheduler depends on the
  typed-atomics API for ARM64-correct memory ordering on Windows.
- **`xidata.h` cross-interpreter shim** — the `#if PY_VERSION_HEX`
  ladders for the `_PyXIData_*` / `_PyCrossInterpreterData_*` APIs
  that previously lived in both `_core.c` and `_math.c` have been
  centralised in one header covering CPython 3.12 through 3.15
  (including free-threaded builds).
- **`fanout_benchmark` example** — a fan-out / fan-in benchmark
  harness exercising scheduler throughput under heavy producer
  load.

**Improvements**

- **In-memory transpiled-module loading** — workers no longer write
  the transpiled module to a temporary directory and import it
  through `importlib.util.spec_from_file_location`. Instead, the
  transpiled source is embedded as a string literal in the worker
  bootstrap and `exec`'d into a fresh `types.ModuleType` registered
  in `sys.modules`. The source is also published to `linecache` so
  tracebacks still point at the transpiled lines. This removes the
  `export_dir` argument from `start()` (and the matching tempdir
  cleanup in `wait()`/`stop()`), eliminates a filesystem round-trip
  on every worker startup, and avoids leaving `.py` files behind on
  abnormal exit. Module names are validated as dotted Python
  identifiers at the boundary, and `__main__` is re-aliased to
  `__bocmain__` inside workers so a follow-up `start()` observes a
  clean `sys.modules`.
- **Nested `@when` capture** — the transpiler now recurses into
  `@when`-decorated nested functions when computing the outer
  behavior's captures, so a behavior body can schedule child
  behaviors that close over the outer frame's free names without
  raising `NameError` at dispatch time.
- **C extension split into subsystem TUs** — `_core.c` has been
  reduced from ~5,000 lines to ~3,500 by extracting `sched.{c,h}`
  (work-stealing scheduler), `noticeboard.{c,h}`, `terminator.{c,h}`,
  `tags.{c,h}` (message-queue tag table), `cown.h` (cown refcount
  helpers), and `compat.{c,h}` / `xidata.h` into separate
  translation units. Every public function now has a header
  declaration with Doxygen-style documentation.
- **Direct dispatch on cown release** — `behavior_release_all` now
  hands a resolved successor directly to a worker via the
  work-stealing dispatch path (`boc_sched_dispatch`) instead of
  re-entering the central scheduler, removing one queue hop per
  cown handoff.
- **Cooperative worker shutdown** — `boc_sched_worker_request_stop_all`
  and `boc_sched_unpause_all` provide a clean stop/drain protocol
  that interacts correctly with parked workers and the terminator.

**Internal Test Modules**

- **`_internal_test_atomics`** — pytest-driven correctness tests for
  the `compat.h` typed-atomics API on every supported platform.
- **`_internal_test_bq`** — torture tests for the MPMC behavior
  queue (`boc_bq_*`), covering segmented dequeue, FIFO fairness,
  and concurrent producer / consumer races.
- **`_internal_test_wsq`** — tests for the work-stealing primitives
  (fast pop, slow pop, steal, park / unpark handshake).

**Test Suite**

- New scheduler test files — `test_scheduler_integration.py`,
  `test_scheduler_mpmcq.py`, `test_scheduler_pertask_queue.py`,
  `test_scheduler_stats.py`, `test_scheduler_steal.py`,
  `test_scheduler_wsq.py` — exercise the distributed scheduler end
  to end and per primitive.
- `test_compat_atomics.py` — Python-level smoke tests for the
  portable atomics layer.
- `test_stop_retry_composition.py` — covers `stop()` / `start()` /
  `wait()` retry composition across multiple runtime cycles.
- `test_scheduling_stress.py` substantially expanded with new
  fan-out, work-stealing, and shutdown stress scenarios.
- `test_boc.py` and `test_transpiler.py` extended with regression
  cases discovered during the scheduler rewrite.

Signed-off-by: Matthew A Johnson <matthew@matthewajohnson.org>
Signed-off-by: Matthew A Johnson <matjoh@microsoft.com>
Signed-off-by: Matthew A Johnson <matthew@matthewajohnson.org>
Signed-off-by: Matthew A Johnson <matjoh@microsoft.com>
Signed-off-by: Matthew A Johnson <matthew@matthewajohnson.org>
---
 .flake8                                       |    1 +
 .github/copilot-instructions.md               |   14 +
 .github/skills/branch-review/SKILL.md         |  125 +-
 .../skills/multi-perspective-plan/SKILL.md    |  201 +-
 .github/workflows/pr_gate.yml                 |    8 +
 CHANGELOG.md                                  |   96 +
 CITATION.cff                                  |    4 +-
 examples/benchmark.py                         | 2400 ++++++++--------
 examples/fanout_benchmark.py                  |  884 ++++++
 pyproject.toml                                |    2 +-
 setup.py                                      |   61 +-
 sphinx/source/conf.py                         |    2 +-
 src/bocpy/__init__.pyi                        |   59 +-
 src/bocpy/_core.c                             | 2551 +++++++----------
 src/bocpy/_core.pyi                           |   66 +
 src/bocpy/_internal_test.c                    |   73 +
 src/bocpy/_internal_test_atomics.c            |  423 +++
 src/bocpy/_internal_test_bq.c                 |  347 +++
 src/bocpy/_internal_test_wsq.c                |  346 +++
 src/bocpy/_math.c                             |  103 +-
 src/bocpy/behaviors.py                        |  676 +++--
 src/bocpy/compat.c                            |  103 +
 src/bocpy/compat.h                            |  935 ++++++
 src/bocpy/cown.h                              |   40 +
 src/bocpy/noticeboard.c                       |  704 +++++
 src/bocpy/noticeboard.h                       |  156 +
 src/bocpy/sched.c                             | 1383 +++++++++
 src/bocpy/sched.h                             |  936 ++++++
 src/bocpy/tags.c                              |  108 +
 src/bocpy/tags.h                              |  113 +
 src/bocpy/terminator.c                        |  120 +
 src/bocpy/terminator.h                        |   84 +
 src/bocpy/transpiler.py                       |   24 +
 src/bocpy/worker.py                           |  212 +-
 src/bocpy/xidata.h                            |  206 ++
 test/test_boc.py                              |  444 ++-
 test/test_compat_atomics.py                   |  195 ++
 test/test_internal_mpmcq.py                   |  196 ++
 test/test_internal_wsq.py                     |  124 +
 test/test_matrix.py                           |    3 +-
 test/test_message_queue.py                    |   41 +-
 test/test_noticeboard.py                      |    3 +-
 test/test_scheduler_integration.py            |  199 ++
 test/test_scheduler_stats.py                  |  257 ++
 test/test_scheduler_steal.py                  |  252 ++
 test/test_scheduling_stress.py                |  387 ++-
 test/test_stop_retry_composition.py           |  175 ++
 test/test_transpiler.py                       |   73 +
 48 files changed, 12708 insertions(+), 3207 deletions(-)
 create mode 100644 examples/fanout_benchmark.py
 create mode 100644 src/bocpy/_core.pyi
 create mode 100644 src/bocpy/_internal_test.c
 create mode 100644 src/bocpy/_internal_test_atomics.c
 create mode 100644 src/bocpy/_internal_test_bq.c
 create mode 100644 src/bocpy/_internal_test_wsq.c
 create mode 100644 src/bocpy/compat.c
 create mode 100644 src/bocpy/compat.h
 create mode 100644 src/bocpy/cown.h
 create mode 100644 src/bocpy/noticeboard.c
 create mode 100644 src/bocpy/noticeboard.h
 create mode 100644 src/bocpy/sched.c
 create mode 100644 src/bocpy/sched.h
 create mode 100644 src/bocpy/tags.c
 create mode 100644 src/bocpy/tags.h
 create mode 100644 src/bocpy/terminator.c
 create mode 100644 src/bocpy/terminator.h
 create mode 100644 src/bocpy/xidata.h
 create mode 100644 test/test_compat_atomics.py
 create mode 100644 test/test_internal_mpmcq.py
 create mode 100644 test/test_internal_wsq.py
 create mode 100644 test/test_scheduler_integration.py
 create mode 100644 test/test_scheduler_stats.py
 create mode 100644 test/test_scheduler_steal.py
 create mode 100644 test/test_stop_retry_composition.py

diff --git a/.flake8 b/.flake8
index 3b3ce69..880c19d 100644
--- a/.flake8
+++ b/.flake8
@@ -4,6 +4,7 @@ inline-quotes = double
 import-order-style = google
 docstring-convention = google
 max-line-length = 120
+application-import-names = bocpy,examples
 
 extend-ignore = E203, N812, N817
 
diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md
index 3bbe3fa..0c9c9bd 100644
--- a/.github/copilot-instructions.md
+++ b/.github/copilot-instructions.md
@@ -162,6 +162,20 @@ pip install -e .[linting]    # linting deps
 flake8 src/ test/            # lint check
 ```
 
+The private `bocpy._internal_test` C extension (used by
+`test_internal_mpmcq.py`, `test_internal_wsq.py`, and
+`test_compat_atomics.py`) is **not** built by default — it is gated off
+in [setup.py](../setup.py) so it never ships in distributed wheels. To
+run those test files locally, opt in at install time:
+
+```bash
+BOCPY_BUILD_INTERNAL_TESTS=1 pip install -e .[test]
+```
+
+Without the env var, the affected tests skip cleanly via
+`pytest.importorskip`. CI sets the variable at the workflow level in
+`.github/workflows/pr_gate.yml`.
+
 Never run `pip`, `pytest`, `python`, or any project command outside the
 activated venv. If you need to validate a fix against more than one Python
 version, re-install and re-run the suite in each relevant venv.
diff --git a/.github/skills/branch-review/SKILL.md b/.github/skills/branch-review/SKILL.md
index 4397d46..40e124c 100644
--- a/.github/skills/branch-review/SKILL.md
+++ b/.github/skills/branch-review/SKILL.md
@@ -1,6 +1,6 @@
 ---
 name: branch-review
-description: "Multi-perspective code review for a branch before merging. Use when: reviewing a branch, preparing a PR, pre-merge review, auditing a feature branch, or when /branch-review is invoked. Spawns three constructive reviewer subagents (correctness, security, usability), then runs an adversarial gap analysis to find what they missed, and synthesizes all findings into a unified review report."
+description: "Multi-perspective code review for a branch before merging. Use when: reviewing a branch, preparing a PR, pre-merge review, auditing a feature branch, or when /branch-review is invoked. Spawns three constructive reviewer subagents (correctness, security, usability), then runs an adversarial gap analysis to find what they missed, and synthesizes all findings into a unified review report. All intermediate artifacts are persisted to .copilot/ so the process can be restarted from any step."
 argument-hint: "Branch name or merge target (e.g. 'main' or 'feature/foo -> main')"
 ---
 
@@ -27,19 +27,68 @@ Findings use the same severity scale as the **review-loop** skill:
 | **medium** | Code smell, unclear logic, missing edge case, or maintainability concern. Recommended fix. |
 | **low** | Style nit, naming suggestion, minor improvement. Fix at discretion. |
 
+## Persistence and Restart
+
+Every intermediate artifact produced by this skill is written to disk under
+`.copilot/reviews/<slug>/`, where `<slug>` is a short kebab-case name derived
+from the branch under review (e.g. `work-stealing-scheduler` for a branch
+named `feature/work-stealing-scheduler`). This makes the process **fully
+resumable**: if any step fails, is interrupted, or produces an unsatisfactory
+result, you can re-run only the affected step using the on-disk artifacts
+from prior steps as input.
+
+### Directory layout
+
+```
+.copilot/reviews/<slug>/
+├── 00-context.md                       # Step 2 output (shared context block)
+├── 00-diff.patch                       # Step 1 raw diff
+├── 00-changed-files.txt                # Step 1 file list
+├── 10-review-correctness-lens.md       # Step 3 outputs (one per lens)
+├── 10-review-security-lens.md
+├── 10-review-usability-lens.md
+├── 20-adversarial.md                   # Step 4 output
+├── 30-synthesis.md                     # Step 5 output (deduped findings)
+├── 40-report.md                        # Step 6 output (final unified report)
+├── 50-fixes-iter1.md                   # Step 7 notes (per fix pass, optional)
+├── 50-fixes-iter2.md
+└── ...
+```
+
+Numeric prefixes preserve chronological order. The `<slug>` directory is
+created at step 1 and reused for the whole run. If the same branch is
+re-reviewed after fixes (step 8 loop-back), append a generation suffix
+(e.g. `<slug>-r2/`) rather than overwriting the prior review.
+
+### Restart contract
+
+At the start of every step, **check whether the corresponding output file
+already exists**. If it does:
+
+- Either reuse it (skip re-running the step), or
+- Explicitly overwrite it (re-run the step from scratch).
+
+Ask the user which to do if the choice is non-obvious. Never silently discard
+an existing artifact.
+
+When the user asks to "restart from step N", load all artifacts numbered
+below N into context and re-run from step N onward.
+
 ## Procedure
 
 ### 1. Gather the Diff
 
-Determine the branch and its merge target (default: `main`). Collect the diff
-using one of these methods, in order of preference:
+Determine the branch and its merge target (default: `main`) and derive the
+slug. Create `.copilot/reviews/<slug>/` if it does not already exist.
+
+Collect the diff using one of these methods, in order of preference:
 
 1. `git diff <merge-target>...<branch> -- . ':!*.lock'` — full diff against
-   the merge base.
+   the merge base. Save to `00-diff.patch`.
 2. `get_changed_files` — if the working tree has uncommitted changes that are
    part of the review.
 
-Also collect the list of changed files:
+Also collect the list of changed files and save to `00-changed-files.txt`:
 
 ```
 git diff --name-only <merge-target>...<branch>
@@ -50,14 +99,22 @@ diff and the surrounding context.
 
 ### 2. Build the Context Block
 
-Assemble a context block that every reviewer will receive. It must include:
+Assemble a context block that every reviewer will receive and write it to
+`.copilot/reviews/<slug>/00-context.md`. This file must be self-contained:
+any subagent reading it should have everything it needs without further file
+lookups (beyond the diff/changed-files artifacts referenced by path). Include:
 
-- **Diff** — the full unified diff.
-- **Changed files** — full current content of each modified file.
+- **Branch and merge target** — branch name, base, commit range.
+- **Diff** — the full unified diff (or a reference to `00-diff.patch` if
+  large, with key hunks inlined).
+- **Changed files** — list from `00-changed-files.txt` plus full current
+  content of each modified file (or excerpts with line ranges if very large).
 - **Related tests** — content of test files that cover the changed code, if
   identifiable.
 - **Project conventions** — brief summary of relevant conventions from
   `copilot-instructions.md` (style, commenting, error handling, etc.).
+- **Prior audits** — pointers to any prior review artifacts the user has
+  flagged as already-covered (so reviewers know what is in/out of scope).
 
 Keep the context block identical across all four reviewers to ensure a fair
 comparison.
@@ -65,8 +122,8 @@ comparison.
 ### 3. Spawn Three Constructive Reviewer Lens Subagents
 
 Launch three subagents **in parallel**, each using a named lens agent operating
-in **review mode**. Each receives the context block and must return findings in
-the severity-tagged format defined above.
+in **review mode**. Each receives the context block (by path) and must return
+findings in the severity-tagged format defined above.
 
 | # | Agent | Focus |
 |---|-------|-------|
@@ -76,8 +133,11 @@ the severity-tagged format defined above.
 
 Each subagent prompt must include:
 
-- The shared context block
+- A directive to read `.copilot/reviews/<slug>/00-context.md` as its context
 - An instruction to operate in **review mode**
+- A directive to **write the resulting findings to**
+  `.copilot/reviews/<slug>/10-review-<lens>.md` and return a brief
+  confirmation plus the file path
 - These instructions:
 
   > Review the diff and changed files from the perspective described above.
@@ -94,6 +154,9 @@ Each subagent prompt must include:
   > Do NOT fabricate issues. Only report genuine problems.
   > Order findings by severity (critical first).
 
+After the subagents return, verify all three `10-review-*.md` files exist
+before continuing.
+
 ### 4. Adversarial Gap Analysis
 
 After the three constructive reviewers return, spawn a fresh `adversarial-lens`
@@ -103,13 +166,15 @@ others missed.
 
 The adversarial subagent prompt must include:
 
-- The shared context block
-- The full list of findings from the three constructive reviewers
+- A directive to read `.copilot/reviews/<slug>/00-context.md` and all three
+  `.copilot/reviews/<slug>/10-review-*.md` files
+- A directive to write its findings to
+  `.copilot/reviews/<slug>/20-adversarial.md`
 - These instructions:
 
-  > You are the adversarial reviewer. The findings below were produced by three
-  > constructive reviewers (correctness, security, usability). Your job is to
-  > find what they missed.
+  > You are the adversarial reviewer. The findings in the `10-review-*.md`
+  > files were produced by three constructive reviewers (correctness,
+  > security, usability). Your job is to find what they missed.
   >
   > Focus on:
   > - Code sections covered by NO existing finding (overlooked areas)
@@ -128,14 +193,16 @@ The adversarial subagent prompt must include:
   >
   >   where SEVERITY is one of: critical, high, medium, low.
   >
-  > If the existing findings are comprehensive and you find no gaps, state
-  > explicitly: "No additional issues found."
+  > If the existing findings are comprehensive and you find no gaps, the
+  > file must contain exactly: "No additional issues found."
   > Do NOT duplicate issues already reported. Only report NEW problems.
   > Order findings by severity (critical first).
 
 ### 5. Deduplicate and Synthesize
 
-After all four reviewers (three constructive + adversarial) have returned:
+Read all four reviewer outputs (`10-review-*.md` and `20-adversarial.md`)
+and write a synthesized findings list to
+`.copilot/reviews/<slug>/30-synthesis.md`:
 
 1. **Merge duplicates.** If multiple reviewers flag the same issue, keep the
    most detailed version and note which perspectives flagged it (higher
@@ -148,10 +215,13 @@ After all four reviewers (three constructive + adversarial) have returned:
    or construct a minimal reproduction. Mark any finding you cannot verify as
    **[unverified]**.
 
+Each synthesized finding should retain its severity tag and a "Flagged by"
+attribution listing the contributing lenses.
+
 ### 6. Present the Report
 
-Present a single unified review report to the user with these sections, in
-order:
+Assemble the final report at `.copilot/reviews/<slug>/40-report.md` and
+present it to the user. The report must contain these sections, in order:
 
 1. **Summary** — one-paragraph overview: number of findings by severity, overall
    assessment (e.g., "ready to merge with minor fixes" or "has blocking issues").
@@ -191,7 +261,14 @@ For each approved finding:
 2. Confirm each fix briefly as it is applied.
 3. Run relevant tests after each fix to verify no regressions.
 
-If a fix is ambiguous or touches architecture, ask the user for guidance.
+Record a short summary of the pass to
+`.copilot/reviews/<slug>/50-fixes-iter<i>.md` (incrementing `i` for each
+re-review pass) noting which findings were addressed, which were deferred,
+and any test results. This makes it possible to resume mid-remediation if
+the session is interrupted.
+
+If a fix is ambiguous or touches architecture, ask the user for guidance and
+record the decision in the same `50-fixes-iter<i>.md` file.
 
 ### 8. Check Exit or Re-review
 
@@ -200,7 +277,9 @@ After all approved fixes are applied:
 > All approved fixes have been applied and tests pass. Should I run another
 > review pass on the updated diff, or is the branch ready to merge?
 
-- If the user wants another pass → go to **step 1** with the updated diff.
+- If the user wants another pass → create a new generation directory
+  (e.g. `<slug>-r2/`) and go to **step 1** with the updated diff. The prior
+  review's artifacts remain on disk for reference.
 - If the user is satisfied → exit.
 
 ## Guidelines
diff --git a/.github/skills/multi-perspective-plan/SKILL.md b/.github/skills/multi-perspective-plan/SKILL.md
index e25eea0..74785be 100644
--- a/.github/skills/multi-perspective-plan/SKILL.md
+++ b/.github/skills/multi-perspective-plan/SKILL.md
@@ -1,6 +1,6 @@
 ---
 name: multi-perspective-plan
-description: "Multi-perspective planning with rebuttal rounds and adversarial review loop. Use when: planning complex changes, designing architecture, evaluating implementation strategies, drafting implementation plans, or when /plan is invoked. Spawns three planner subagents, runs rebuttals on disagreements, synthesizes their outputs, then iteratively hardens the plan through an adversarial review loop until it passes scrutiny."
+description: "Multi-perspective planning with rebuttal rounds and adversarial review loop. Use when: planning complex changes, designing architecture, evaluating implementation strategies, drafting implementation plans, or when /plan is invoked. Spawns three planner subagents, runs rebuttals on disagreements, synthesizes their outputs, then iteratively hardens the plan through an adversarial review loop until it passes scrutiny. All intermediate artifacts are persisted to .copilot/ so the process can be restarted from any step."
 argument-hint: "Describe the change or feature to plan"
 ---
 
@@ -16,14 +16,70 @@ loop.
 - Evaluating architecture or design trade-offs
 - Any time you want a plan stress-tested before implementation
 
+## Persistence and Restart
+
+Every intermediate artifact produced by this skill is written to disk under
+`.copilot/plans/<slug>/`, where `<slug>` is a short kebab-case name derived
+from the planning task (e.g. `work-stealing-scheduler`). This makes the
+process **fully resumable**: if any step fails, is interrupted, or produces
+an unsatisfactory result, you can re-run only the affected step using the
+on-disk artifacts from prior steps as input.
+
+### Directory layout
+
+```
+.copilot/plans/<slug>/
+├── 00-context.md                       # Step 1 output
+├── 10-plan-speed-lens.md               # Step 2 outputs (one per lens)
+├── 10-plan-usability-lens.md
+├── 10-plan-conservative-lens.md
+├── 20-analysis.md                      # Step 3 output
+├── 30-rebuttal-<topic>-<lens>.md       # Step 4 outputs (one per lens per topic)
+├── 40-draft-plan.md                    # Step 5 output
+├── 50-adversarial-iter1.md             # Step 6a output, iteration 1
+├── 50-revisions-iter1.md               # Step 6b notes for iteration 1
+├── 50-adversarial-iter2.md
+├── 50-revisions-iter2.md
+├── ...
+└── 99-final-plan.md                    # Step 7 output
+```
+
+Numeric prefixes preserve chronological order. The `<slug>` directory is
+created at step 1 and reused for the whole run.
+
+### Restart contract
+
+At the start of every step, **check whether the corresponding output file
+already exists**. If it does:
+
+- Either reuse it (skip re-running the step), or
+- Explicitly overwrite it (re-run the step from scratch).
+
+Ask the user which to do if the choice is non-obvious. Never silently discard
+an existing artifact.
+
+When the user asks to "restart from step N", load all artifacts numbered
+below N into context and re-run from step N onward.
+
 ## Procedure
 
 ### 1. Gather Context
 
 Before spawning planners, collect enough context about the target code so each
 subagent can work from the same facts. Read the relevant source files and tests.
-Summarize the current state in a brief context block that will be included in
-every subagent prompt.
+
+Write the context block to `.copilot/plans/<slug>/00-context.md`. This file
+must be self-contained: any subagent reading it should have everything it
+needs without further file lookups. Include:
+
+- The planning task as stated by the user
+- A summary of the current state of the relevant code
+- Key file paths and line ranges that matter
+- Any constraints or invariants the plan must respect
+- Pointers to related artifacts (sketches, prior plans, benchmark JSONs)
+
+If a sketch document already exists (e.g. `.copilot/<slug>.md`), reference it
+from `00-context.md` rather than duplicating its contents.
 
 ### 2. Spawn Three Planner Lens Subagents
 
@@ -39,79 +95,88 @@ implementation plan (not just commentary).
 
 Each subagent prompt must include:
 
-- The shared context block
+- A directive to read `.copilot/plans/<slug>/00-context.md` as its context
 - An instruction to operate in **planning mode**
 - A request for a **numbered step-by-step plan** with rationale per step
 - A request for **risks and mitigations** specific to their perspective
+- A directive to **write the resulting plan to**
+  `.copilot/plans/<slug>/10-plan-<lens>.md` and return a brief confirmation
+  plus the file path
+
+After the subagents return, verify all three files exist before continuing.
 
 ### 3. Review the Three Plans
 
-After all three subagents return, review their outputs yourself. Write a brief
-analysis noting:
+Read all three `10-plan-*.md` files. Write a brief analysis to
+`.copilot/plans/<slug>/20-analysis.md` noting:
 
 - Points of agreement (high-confidence decisions)
-- Points of disagreement (trade-offs to resolve)
+- Points of disagreement (trade-offs to resolve), each labelled with a short
+  topic slug for use in step 4 filenames
 - Any gaps none of the planners addressed
 
 ### 4. Rebuttals (If Disagreements Exist)
 
-If step 3 identified points of disagreement, run a rebuttal round.
+If `20-analysis.md` lists any disagreements, run a rebuttal round.
 
-For **each disagreement**, identify which lenses hold competing positions. Then
-spawn those lenses **in parallel** as fresh subagents operating in **rebuttal
-mode**. Each subagent receives:
+For **each disagreement topic**, identify which lenses hold competing
+positions. Spawn those lenses **in parallel** as fresh subagents operating
+in **rebuttal mode**. Each subagent receives:
 
-- The specific point of disagreement
-- Its own original recommendation
-- The competing recommendation(s) from the other lens(es)
+- The path to `00-context.md`
+- The specific point of disagreement (quoted from `20-analysis.md`)
+- The path to its own original plan and the competing plan(s)
 - An instruction to argue concisely for why its approach is best and why the
   alternatives are inferior — one turn only
+- A directive to write its rebuttal to
+  `.copilot/plans/<slug>/30-rebuttal-<topic>-<lens>.md`
 
-Collect the rebuttals. If there are **no disagreements**, skip this step
-entirely.
+If there are **no disagreements**, skip this step. Record that fact in
+`20-analysis.md` so a restarted run knows step 4 is intentionally empty.
 
 ### 5. Synthesize
 
-Send all three original plans, **your analysis from step 3**, and **any
-rebuttals from step 4** to a `synthesis-lens` subagent operating in **planning
-mode**.
+Spawn a `synthesis-lens` subagent operating in **planning mode**. Its prompt
+must direct it to read:
+
+- `00-context.md`
+- All three `10-plan-*.md` files
+- `20-analysis.md`
+- All `30-rebuttal-*.md` files (if any)
 
-The subagent must produce a numbered step-by-step implementation sequence with
-clear rationale. For each disagreement, it must pick one option and justify the
-choice by engaging with the rebuttal arguments — not ignoring or averaging
-them. Flag any unresolved risks.
+The subagent must produce a numbered step-by-step implementation sequence
+with clear rationale, written to `.copilot/plans/<slug>/40-draft-plan.md`.
+For each disagreement, it must pick one option and justify the choice by
+engaging with the rebuttal arguments — not ignoring or averaging them. Flag
+any unresolved risks.
 
-If the synthesis agent reports any **unresolved disagreements** (trade-offs it
-could not resolve), **stop and present them to the user**. For each unresolved
-item, show:
+If the synthesis agent reports any **unresolved disagreements** (trade-offs
+it could not resolve), **stop and present them to the user**. For each
+unresolved item, show:
 
 - The competing options with their lens attribution
 - The key argument from each side's rebuttal
 - Why the choice matters
 
-Wait for the user to decide before proceeding. Incorporate the user's decisions
-into the plan.
-
-The output of this step is the **draft plan**.
+Wait for the user to decide before proceeding. Incorporate the user's
+decisions into `40-draft-plan.md` directly.
 
 ### 6. Adversarial Review Loop
 
-Iteratively harden the draft plan by running adversarial reviews until the plan
-passes scrutiny. Each iteration proceeds as follows:
+Iteratively harden the draft plan by running adversarial reviews until the
+plan passes scrutiny. Each iteration `i` (starting at 1) proceeds as follows:
 
 #### 6a. Spawn Adversarial Reviewer
 
-Launch a fresh `adversarial-lens` subagent operating in **planning mode** with
-the following prompt structure:
+Launch a fresh `adversarial-lens` subagent operating in **planning mode**.
+Its prompt must direct it to read `00-context.md` and the current plan
+(initially `40-draft-plan.md`, then the most recently revised version) and
+to write its findings to `.copilot/plans/<slug>/50-adversarial-iter<i>.md`
+using this structure:
 
-> **Plan to review:**
-> {include the full draft plan}
->
-> **Codebase context:**
-> {include the shared context block from step 1}
+> **Plan reviewed:** <path>
 >
-> **Instructions:**
-> - For each issue found, report it in this exact format:
+> For each issue found, report it in this exact format:
 >
 >   **[SEVERITY] Short title**
 >   - **Location:** plan step number
@@ -120,42 +185,46 @@ the following prompt structure:
 >
 >   where SEVERITY is one of: critical, high, medium, low.
 >
-> - If the plan survives your scrutiny, state explicitly: "LGTM — no issues
->   found."
-> - Do NOT fabricate issues. Only report genuine problems.
-> - Order findings by severity (critical first).
+> If the plan survives scrutiny, the file must contain exactly:
+> "LGTM — no issues found."
+>
+> Do NOT fabricate issues. Order findings by severity (critical first).
+
+#### 6b. Evaluate Findings and Revise
 
-#### 6b. Evaluate Findings
+Read `50-adversarial-iter<i>.md`:
 
-After the adversarial reviewer returns:
+- If it contains **"LGTM"**, the plan is final. Proceed to step 7.
+- Otherwise, address the findings:
+  - **critical** / **high**: revise the plan to fix or mitigate.
+  - **medium**: revise if straightforward; otherwise document as a risk.
+  - **low**: note and move on.
 
-- If the reviewer reports **"LGTM"** (no issues found), the plan is final.
-  Proceed to step 7.
-- If the reviewer reports findings, address them:
-  - For **critical** and **high** findings: revise the plan to fix or mitigate
-    each issue. Update the draft plan in-place.
-  - For **medium** findings: revise if the fix is straightforward; otherwise
-    add as a documented risk in the plan.
-  - For **low** findings: note and move on.
+Update the draft plan **in place** at the same path it was loaded from.
+Write a short note to `.copilot/plans/<slug>/50-revisions-iter<i>.md`
+summarising which findings were addressed and how, and which were
+deliberately deferred or rejected.
 
 #### 6c. Check for Stuck State
 
-If after addressing findings you are **unsure how to proceed** — for example,
-the adversarial reviewer raises a concern that conflicts with a core
+If you are unsure how to proceed — e.g. a concern conflicts with a core
 requirement, or two mitigations are mutually exclusive — **stop and ask the
-user** for guidance. Present the specific dilemma and the options you see.
+user**. Present the dilemma and the options. Save the user's decision into
+`50-revisions-iter<i>.md` so a restart can recover it.
 
 #### 6d. Repeat
 
-Go back to step 6a with the revised plan. Use a fresh subagent each time (no
-memory of previous passes).
+Increment `i` and go back to step 6a with the revised plan. Use a fresh
+subagent each time (no memory of previous passes).
 
-**Bound:** If the loop has run **3 times** without reaching LGTM, present the
-current plan to the user with all remaining unresolved findings and ask how to
-proceed.
+**Bound:** If the loop has run **3 times** (`50-adversarial-iter3.md`
+exists and is not LGTM) without reaching LGTM, present the current plan to
+the user with all remaining unresolved findings and ask how to proceed.
 
 ### 7. Present
 
-Present the final plan to the user for approval. Clearly attribute which ideas
-came from which perspective where relevant. Note any risks that survived the
-adversarial review as known trade-offs.
+Copy the final plan to `.copilot/plans/<slug>/99-final-plan.md` and present
+it to the user for approval. Clearly attribute which ideas came from which
+perspective where relevant. Note any risks that survived the adversarial
+review as known trade-offs, and reference the iteration files that
+documented their resolution.
diff --git a/.github/workflows/pr_gate.yml b/.github/workflows/pr_gate.yml
index f3ffa65..aa54214 100644
--- a/.github/workflows/pr_gate.yml
+++ b/.github/workflows/pr_gate.yml
@@ -5,6 +5,14 @@ on:
     branches: ["main"]
   workflow_dispatch:
 
+# Build the private `bocpy._internal_test` C extension across all CI test
+# jobs. It is gated off by default in setup.py so it never ships in
+# distributed wheels; CI opts in here so the corresponding pytest modules
+# (test_internal_*, test_compat_atomics) actually run instead of being
+# skipped.
+env:
+  BOCPY_BUILD_INTERNAL_TESTS: "1"
+
 jobs:
   linting:
     runs-on: ubuntu-latest
diff --git a/CHANGELOG.md b/CHANGELOG.md
index fa78877..1033a33 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -1,3 +1,99 @@
+## 2026-04-29 - Version 0.5.0
+Verona-RT-style work-stealing scheduler, C source split into per-subsystem
+translation units, and a portable atomics / threading layer.
+
+**New Features**
+
+- **Work-stealing scheduler** — the single behavior queue has been
+  replaced with a Verona-RT-inspired distributed scheduler. Each
+  worker owns a Multi-Producer Multi-Consumer behavior queue
+  (`boc_bq_*`, ported from `verona-rt/src/rt/sched/mpmcq.h`), pops
+  work from its own queue first, and steals from peers when empty.
+  Idle workers park on a per-worker condition variable and are
+  signalled directly by the producer / victim, eliminating the
+  central wakeup broadcast. Per-worker statistics (steals, parks,
+  fast/slow pops, dispatches) are exposed for benchmarking.
+- **Per-worker fairness tokens** — each worker advances a token node
+  through its own queue so that long-running behaviors cannot
+  monopolise dispatch slots. The token is also used to drive the
+  cooperative shutdown handshake.
+- **`compat.h` / `compat.c` portability layer** — a single header now
+  exposes uniform `BOCMutex`, `BOCCond`, `boc_atomic_*_explicit`,
+  monotonic-time, and sleep primitives across MSVC, pthreads, and
+  C11 `<threads.h>`. The work-stealing scheduler depends on the
+  typed-atomics API for ARM64-correct memory ordering on Windows.
+- **`xidata.h` cross-interpreter shim** — the `#if PY_VERSION_HEX`
+  ladders for the `_PyXIData_*` / `_PyCrossInterpreterData_*` APIs
+  that previously lived in both `_core.c` and `_math.c` have been
+  centralised in one header covering CPython 3.12 through 3.15
+  (including free-threaded builds).
+- **`fanout_benchmark` example** — a fan-out / fan-in benchmark
+  harness exercising scheduler throughput under heavy producer
+  load.
+
+**Improvements**
+
+- **In-memory transpiled-module loading** — workers no longer write
+  the transpiled module to a temporary directory and import it
+  through `importlib.util.spec_from_file_location`. Instead, the
+  transpiled source is embedded as a string literal in the worker
+  bootstrap and `exec`'d into a fresh `types.ModuleType` registered
+  in `sys.modules`. The source is also published to `linecache` so
+  tracebacks still point at the transpiled lines. This removes the
+  `export_dir` argument from `start()` (and the matching tempdir
+  cleanup in `wait()`/`stop()`), eliminates a filesystem round-trip
+  on every worker startup, and avoids leaving `.py` files behind on
+  abnormal exit. Module names are validated as dotted Python
+  identifiers at the boundary, and `__main__` is re-aliased to
+  `__bocmain__` inside workers so a follow-up `start()` observes a
+  clean `sys.modules`.
+- **Nested `@when` capture** — the transpiler now recurses into
+  `@when`-decorated nested functions when computing the outer
+  behavior's captures, so a behavior body can schedule child
+  behaviors that close over the outer frame's free names without
+  raising `NameError` at dispatch time.
+- **C extension split into subsystem TUs** — `_core.c` has been
+  reduced from ~5,000 lines to ~3,500 by extracting `sched.{c,h}`
+  (work-stealing scheduler), `noticeboard.{c,h}`, `terminator.{c,h}`,
+  `tags.{c,h}` (message-queue tag table), `cown.h` (cown refcount
+  helpers), and `compat.{c,h}` / `xidata.h` into separate
+  translation units. Every public function now has a header
+  declaration with Doxygen-style documentation.
+- **Direct dispatch on cown release** — `behavior_release_all` now
+  hands a resolved successor directly to a worker via the
+  work-stealing dispatch path (`boc_sched_dispatch`) instead of
+  re-entering the central scheduler, removing one queue hop per
+  cown handoff.
+- **Cooperative worker shutdown** — `boc_sched_worker_request_stop_all`
+  and `boc_sched_unpause_all` provide a clean stop/drain protocol
+  that interacts correctly with parked workers and the terminator.
+
+**Internal Test Modules**
+
+- **`_internal_test_atomics`** — pytest-driven correctness tests for
+  the `compat.h` typed-atomics API on every supported platform.
+- **`_internal_test_bq`** — torture tests for the MPMC behavior
+  queue (`boc_bq_*`), covering segmented dequeue, FIFO fairness,
+  and concurrent producer / consumer races.
+- **`_internal_test_wsq`** — tests for the work-stealing primitives
+  (fast pop, slow pop, steal, park / unpark handshake).
+
+**Test Suite**
+
+- New scheduler test files — `test_scheduler_integration.py`,
+  `test_scheduler_mpmcq.py`, `test_scheduler_pertask_queue.py`,
+  `test_scheduler_stats.py`, `test_scheduler_steal.py`,
+  `test_scheduler_wsq.py` — exercise the distributed scheduler end
+  to end and per primitive.
+- `test_compat_atomics.py` — Python-level smoke tests for the
+  portable atomics layer.
+- `test_stop_retry_composition.py` — covers `stop()` / `start()` /
+  `wait()` retry composition across multiple runtime cycles.
+- `test_scheduling_stress.py` substantially expanded with new
+  fan-out, work-stealing, and shutdown stress scenarios.
+- `test_boc.py` and `test_transpiler.py` extended with regression
+  cases discovered during the scheduler rewrite.
+
 ## 2026-04-17 - Version 0.4.0
 Noticeboard, distributed scheduler, and a relocated examples package.
 
diff --git a/CITATION.cff b/CITATION.cff
index 2169b20..d331a6a 100644
--- a/CITATION.cff
+++ b/CITATION.cff
@@ -5,6 +5,6 @@ authors:
   given-names: "Matthew Alastair"
   orcid: "https://orcid.org/0000-0002-1019-8036"
 title: "bocpy"
-version: 0.4.0
-date-released: 2026-04-17
+version: 0.5.0
+date-released: 2026-04-29
 url: "https://github.com/microsoft/bocpy"
\ No newline at end of file
diff --git a/examples/benchmark.py b/examples/benchmark.py
index 615ec64..a227ea2 100644
--- a/examples/benchmark.py
+++ b/examples/benchmark.py
@@ -1,1118 +1,1282 @@
-"""Chain-ring microbenchmark for the BOC runtime.
-
-This benchmark measures *BOC runtime scaling* (scheduler, 2PL, message
-queue, sub-interpreter crossings, return-cown allocation) in isolation
-from any application-specific serial work.  It is **not** a measure of
-how well your own application will scale: real applications carry
-serial costs (data structure construction, scheduling logic,
-result drainage) that this benchmark deliberately eliminates.
-
-A few load-bearing caveats baked into the design:
-
-* Each behavior allocates a fresh return ``Cown`` (the auto-generated
-  one returned by ``@when``).  At thousands of behaviors per second
-  this is a real, version-dependent constant in every sample.
-* ``ChainState`` crosses the interpreter boundary via XIData on every
-  reschedule; for tiny payloads, marshaling can rival the useful work.
-* The ``group-size`` sweep varies acquired-set cardinality and CPU work
-  together (the inner loop multiplies every window slot into
-  ``window[0]``, ``iters * group_size`` matrix multiplies per
-  behavior).  It is not an isolated 2PL-cost knob.
-"""
-
-import argparse
-import json
-import os
-import socket
-import statistics
-import subprocess
-import sys
-import time
-from dataclasses import asdict, dataclass, field
-from datetime import datetime
-from typing import Optional
-
-from bocpy import (Cown, Matrix, noticeboard, notice_write, receive, send,
-                   start, wait, when)
-
-# Sentinels for the parent/child JSON protocol.  Uppercase so the
-# transpiler keeps them as module-level constants in the worker export.
-SENTINEL_BEGIN = "---BOCPY-BENCH-BEGIN---"
-SENTINEL_END = "---BOCPY-BENCH-END---"
-SCHEMA_VERSION = 1
-
-
-# ---------------------------------------------------------------------------
-# Behavior code (chain workload)
-# ---------------------------------------------------------------------------
-
-
-class ChainState:
-    """Per-chain mutable state carried inside a ``Cown[ChainState]``.
-
-    Holds ints only.  The chain's ring of ``Cown[Matrix]`` lives in the
-    noticeboard under ``f"ring_{ring_id}"`` so it is materialized once
-    per worker (and cached for the lifetime of ``NB_VERSION``) instead
-    of being marshaled through XIData on every reschedule.
-    """
-
-    def __init__(self, chain_id: int, ring_id: int, head_idx: int,
-                 iters: int, stride: int, ring_size: int):
-        """Initialize a chain state.
-
-        :param chain_id: A unique id within the workload.
-        :param ring_id: Index of the ring this chain runs on.  Must
-            correspond to a ``f"ring_{ring_id}"`` entry already
-            written to the noticeboard.
-        :param head_idx: Initial head position on the ring.
-        :param iters: Inner-loop matrix multiplications per window slot.
-        :param stride: Step between successive windows.
-        :param ring_size: Number of cowns on the ring.
-        """
-        self.chain_id = chain_id
-        self.ring_id = ring_id
-        self.head_idx = head_idx
-        self.count = 0
-        self.iters = iters
-        self.stride = stride
-        self.ring_size = ring_size
-
-
-def next_window(cs: "ChainState", group_size: int) -> list:
-    """Compute the next sliding window of cowns for a chain.
-
-    Reads the chain's ring from the noticeboard.  Must be called from
-    inside a behavior so that ``noticeboard()`` returns the cached
-    snapshot for the current ``NB_VERSION``.
-
-    :param cs: The chain state.
-    :param group_size: Number of adjacent cowns in the window.
-    :return: ``list[Cown[Matrix]]`` for the next acquired set.
-    """
-    ring = noticeboard()[f"ring_{cs.ring_id}"]
-    return [ring[(cs.head_idx + i * cs.stride) % cs.ring_size]
-            for i in range(group_size)]
-
-
-def schedule_step(state_cown: Cown, window_list: list, group_size: int) -> None:
-    """Schedule one chain step with the given window.
-
-    The static ``@when`` decorator inside this helper is rewritten by
-    the transpiler into a ``whencall`` invocation, so this function
-    works correctly when called from a worker sub-interpreter (where
-    the Python ``when`` decorator is not wired up).
-
-    :param state_cown: The chain's state cown.
-    :param window_list: Adjacent cowns to acquire for this step.
-    :param group_size: Window size, captured into the behavior.
-    """
-    @when(state_cown, window_list)
-    def _step(state, window):
-        cs = state.value
-        # When ``cr_null`` is set, skip the matmul loop entirely.  The
-        # behavior still acquires its window of cowns, mutates
-        # ``ChainState``, and reschedules itself — so the measured
-        # throughput reflects pure BOC runtime overhead (2PL, queue
-        # ops, sub-interpreter crossings, return-cown allocation)
-        # with the application work removed.
-        if not noticeboard().get("cr_null", False):
-            # The inner loop's first slot multiplies window[0] by itself.
-            # Intentional — it keeps the per-behavior multiply count
-            # exactly `iters * group_size`.
-            for _ in range(cs.iters):
-                for c in window:
-                    window[0].value = window[0].value @ c.value
-
-        cs.count += 1
-        cs.head_idx = (cs.head_idx + cs.stride) % cs.ring_size
-        if not noticeboard().get("cr_stop", False):
-            # Pass the already-acquired `state` cown wrapper directly
-            # rather than the closure-captured `state_cown` to keep the
-            # capture set minimal.
-            schedule_step(state, next_window(cs, group_size), group_size)
-
-
-# ---------------------------------------------------------------------------
-# Configuration and result types (plain data only; no Cowns)
-# ---------------------------------------------------------------------------
-
-
-@dataclass
-class BenchConfig:
-    """Plain-data benchmark configuration.
-
-    Holds only ints / floats / strings / lists of the same so that an
-    instance can stay live in ``main()``'s frame across ``wait()``
-    without ``stop_workers`` finding any bare Cowns to acquire.
-    """
-
-    workers: int = 1
-    duration: float = 5.0
-    warmup: float = 1.0
-    iters: int = 2000
-    group_size: int = 2
-    stride: int = 1
-    rings: Optional[int] = None
-    chains_per_ring: Optional[int] = None
-    ring_size: int = 128
-    payload_rows: int = 16
-    payload_cols: int = 16
-    repeats: int = 1
-    null_payload: bool = False
-
-
-@dataclass
-class RepeatResult:
-    """Plain-data result for a single repeat of one sweep point."""
-
-    repeat_index: int
-    completed_behaviors: int
-    elapsed_s: float
-    throughput: float
-    wall_clock_ns_start: int
-
-
-@dataclass
-class PointResult:
-    """Plain-data result for a single sweep point."""
-
-    inputs: dict
-    repeats: list = field(default_factory=list)
-    throughput_mean: Optional[float] = None
-    throughput_stdev: Optional[float] = None
-    throughput_min: Optional[float] = None
-    throughput_max: Optional[float] = None
-    error: Optional[dict] = None
-
-
-# ---------------------------------------------------------------------------
-# Sizing / validation helpers (parent-side, no BOC required)
-# ---------------------------------------------------------------------------
-
-
-def derive_sizes(cfg: BenchConfig) -> BenchConfig:
-    """Auto-size ``rings`` and ``chains_per_ring`` if not overridden.
-
-    :param cfg: An input config (mutated and returned).
-    :return: The same config with ``rings`` / ``chains_per_ring`` set.
-    """
-    if cfg.chains_per_ring is None:
-        cfg.chains_per_ring = max(
-            1, cfg.ring_size // (cfg.group_size * cfg.stride * 2))
-    if cfg.rings is None:
-        cfg.rings = max(cfg.workers * 4 // cfg.chains_per_ring,
-                        cfg.workers * 2)
-    return cfg
-
-
-def validate_config(cfg: BenchConfig) -> Optional[str]:
-    """Validate a fully-derived config; return an error string or None.
-
-    Hard errors only.  Soft warnings (``duration < 1.0``, oversubscribed
-    workers) are emitted by the caller rather than failing here.
-
-    :param cfg: A config with ``rings`` and ``chains_per_ring`` set.
-    :return: An error message, or ``None`` if the config is valid.
-    """
-    if cfg.group_size * cfg.stride * 2 > cfg.ring_size:
-        return (f"group_size*stride*2 ({cfg.group_size}*{cfg.stride}*2) "
-                f"> ring_size ({cfg.ring_size}); chains would collide")
-    if cfg.workers < 1:
-        return f"workers must be >= 1, got {cfg.workers}"
-    if cfg.iters < 1:
-        return f"iters must be >= 1, got {cfg.iters}"
-    if cfg.payload_rows < 1 or cfg.payload_cols < 1:
-        return "payload dimensions must be >= 1"
-    if cfg.duration <= 0 or cfg.warmup < 0:
-        return "duration must be > 0 and warmup must be >= 0"
-    return None
-
-
-def emit_soft_warnings(cfg: BenchConfig, cpu_count: int) -> None:
-    """Print soft warnings for unusual configs to stderr.
-
-    :param cfg: The fully-derived config.
-    :param cpu_count: Detected CPU count for oversubscription check.
-    """
-    if cfg.duration < 1.0:
-        print(f"warning: duration={cfg.duration}s is short; results will "
-              "be noisy", file=sys.stderr)
-    if cfg.workers > cpu_count:
-        print(f"warning: workers={cfg.workers} exceeds cpu_count="
-              f"{cpu_count}; oversubscribed", file=sys.stderr)
-
-
-# ---------------------------------------------------------------------------
-# Workload construction
-# ---------------------------------------------------------------------------
-
-
-def build_workload(cfg: BenchConfig):
-    """Build per-ring cowns and per-chain state cowns.
-
-    Each ring is published to the noticeboard under ``f"ring_{r}"``.
-    Workers read it back via ``noticeboard()`` inside ``_step``; the
-    noticeboard's per-worker version-cache means the ring is
-    materialized once per worker per ``NB_VERSION`` instead of being
-    marshaled through XIData on every reschedule.
-
-    :param cfg: A fully-derived config.
-    :return: A ``(rings, state_cowns)`` tuple.  ``rings`` is
-        ``list[list[Cown[Matrix]]]``; ``state_cowns`` is
-        ``list[Cown[ChainState]]``.  Both containers are invisible to
-        ``stop_workers`` (it does not recurse into containers).
-    """
-    rings = []
-    state_cowns = []
-    chain_id = 0
-    for r in range(cfg.rings):
-        ring = [Cown(Matrix.uniform(0.0, 1.0,
-                                    (cfg.payload_rows, cfg.payload_cols)))
-                for _ in range(cfg.ring_size)]
-        rings.append(ring)
-        notice_write(f"ring_{r}", ring)
-        # Spread chains evenly across the ring so adjacent chains'
-        # initial windows don't overlap.
-        spacing = max(1, cfg.ring_size // cfg.chains_per_ring)
-        for k in range(cfg.chains_per_ring):
-            head = (k * spacing) % cfg.ring_size
-            cs = ChainState(chain_id=chain_id, ring_id=r, head_idx=head,
-                            iters=cfg.iters, stride=cfg.stride,
-                            ring_size=cfg.ring_size)
-            state_cowns.append(Cown(cs))
-            chain_id += 1
-    return rings, state_cowns
-
-
-# ---------------------------------------------------------------------------
-# Snapshot helpers (used by the measurement flow)
-# ---------------------------------------------------------------------------
-
-
-def schedule_snap(state_cowns: list) -> None:
-    """Schedule the final snapshot + publish behaviors.
-
-    See the module docstring for the snap ordering invariant.  This
-    helper is structured so that the bare ``snap`` and ``_publish``
-    return-cown locals fall out of scope at its return boundary,
-    satisfying the no-bare-Cowns-in-main rule before ``wait()`` runs.
-
-    :param state_cowns: Every chain's state cown.
-    """
-    @when(state_cowns)
-    def snap(states):
-        return sum(s.value.count for s in states)
-
-    notice_write("cr_stop", True)
-
-    @when(snap)
-    def _publish(s):
-        send("snap", s.value)
-
-
-def emit_chain_snapshot(state_cown: Cown, tag: str) -> None:
-    """Send a chain's ``(count, head_idx)`` over the queue under ``tag``.
-
-    Used by tests that need to inspect chain progress directly.  The
-    helper lives in this module so the ``@when`` decorator runs through
-    the transpiler that registered ``schedule_step``.
-
-    :param state_cown: The chain's state cown.
-    :param tag: The tag to ``send`` the snapshot under.
-    """
-    @when(state_cown)
-    def _emit(s):
-        send(tag, (s.value.count, s.value.head_idx))
-
-
-# ---------------------------------------------------------------------------
-# Single-point measurement (in-process; one BOC start/wait cycle)
-# ---------------------------------------------------------------------------
-
-
-def run_single_point_body(cfg: BenchConfig, repeat_index: int) -> RepeatResult:
-    """Run one measurement in a fresh BOC runtime; return plain data.
-
-    :param cfg: The fully-derived config.
-    :param repeat_index: Index of this repeat for reporting.
-    :return: A ``RepeatResult`` with no Cown references.
-    """
-    # Start the runtime first: ``build_workload`` writes rings to the
-    # noticeboard, and noticeboard writes require the runtime to be
-    # running.
-    start(worker_count=cfg.workers)
-    rings, state_cowns = build_workload(cfg)
-    # Publish the null-payload toggle so worker behaviors can read it
-    # from their per-behavior noticeboard snapshot.  Written before the
-    # warmup sleep so the noticeboard thread has flushed it well
-    # before t_measure_start.
-    notice_write("cr_null", cfg.null_payload)
-    payload_bytes = cfg.payload_rows * cfg.payload_cols * 8
-    total_bytes = cfg.rings * cfg.ring_size * payload_bytes
-    print(f"workload: rings={cfg.rings} ring_size={cfg.ring_size} "
-          f"chains={cfg.rings * cfg.chains_per_ring} "
-          f"payload={cfg.payload_rows}x{cfg.payload_cols} "
-          f"(~{total_bytes / 1024:.1f} KiB matrix data)",
-          file=sys.stderr)
-
-    try:
-        # Kick off one chain per (ring, chain-slot) pair.  Recompute the
-        # head positions exactly the way `build_workload` chose them:
-        # we cannot read `cs_cown.value` from the main thread because
-        # Cowns are released to the runtime on construction.
-        spacing = max(1, cfg.ring_size // cfg.chains_per_ring)
-        chain_idx = 0
-        for r in range(cfg.rings):
-            for k in range(cfg.chains_per_ring):
-                cs_cown = state_cowns[chain_idx]
-                head = (k * spacing) % cfg.ring_size
-                window = [rings[r][(head + i * cfg.stride) % cfg.ring_size]
-                          for i in range(cfg.group_size)]
-                schedule_step(cs_cown, window, cfg.group_size)
-                chain_idx += 1
-
-        time.sleep(cfg.warmup)
-        wall_clock_ns_start = time.time_ns()
-        t_measure_start = time.perf_counter()
-        time.sleep(cfg.duration)
-
-        schedule_snap(state_cowns)
-        msg = receive(["snap"], 60.0 + cfg.duration)
-        t_snap_received = time.perf_counter()
-        if msg is None or msg[0] != "snap":
-            raise RuntimeError("snap behavior did not publish in time")
-        _, total = msg
-        elapsed_s = t_snap_received - t_measure_start
-    finally:
-        # Drop bare-Cown locals before wait().
-        del rings
-        del state_cowns
-        wait()
-
-    throughput = total / elapsed_s if elapsed_s > 0 else 0.0
-    return RepeatResult(repeat_index=repeat_index,
-                        completed_behaviors=int(total),
-                        elapsed_s=elapsed_s,
-                        throughput=throughput,
-                        wall_clock_ns_start=wall_clock_ns_start)
-
-
-# ---------------------------------------------------------------------------
-# Subprocess orchestration
-# ---------------------------------------------------------------------------
-
-
-def cfg_to_argv(cfg: BenchConfig) -> list:
-    """Render a ``BenchConfig`` as CLI args for a child invocation.
-
-    :param cfg: The config to serialize.
-    :return: A list of CLI arguments suitable for child invocation.
-    """
-    args = [
-        "--workers", str(cfg.workers),
-        "--duration", str(cfg.duration),
-        "--warmup", str(cfg.warmup),
-        "--iters", str(cfg.iters),
-        "--group-size", str(cfg.group_size),
-        "--stride", str(cfg.stride),
-        "--ring-size", str(cfg.ring_size),
-        "--payload-rows", str(cfg.payload_rows),
-        "--payload-cols", str(cfg.payload_cols),
-        "--repeats", "1",
-        "--sweep-axis", "none",
-    ]
-    if cfg.rings is not None:
-        args += ["--rings", str(cfg.rings)]
-    if cfg.chains_per_ring is not None:
-        args += ["--chains-per-ring", str(cfg.chains_per_ring)]
-    if cfg.null_payload:
-        args += ["--null-payload"]
-    return args
-
-
-def run_in_subprocess(cfg: BenchConfig, repeat_index: int,
-                      git_sha: Optional[str]) -> RepeatResult:
-    """Run one repeat in a fresh subprocess and return its result.
-
-    On non-zero exit / timeout / missing sentinel, raises
-    ``RuntimeError`` with a stderr-tail diagnostic so the caller can
-    record an ``error`` entry on the point.
-
-    :param cfg: A fully-derived config with ``repeats`` ignored.
-    :param repeat_index: Index into the parent's ``repeats[]`` list.
-    :param git_sha: Optional git sha to forward to the child.
-    """
-    env = dict(os.environ)
-    if git_sha is not None:
-        env["BOCPY_BENCH_GIT_SHA"] = git_sha
-
-    cmd = [sys.executable, "-m", "bocpy.examples.benchmark",
-           "--json-stdout"] + cfg_to_argv(cfg)
-    timeout = max(cfg.duration * 3 + 30, cfg.duration + cfg.warmup + 60)
-    try:
-        proc = subprocess.run(cmd, env=env, capture_output=True,
-                              text=True, timeout=timeout, check=False)
-    except subprocess.TimeoutExpired as ex:
-        raise RuntimeError(
-            f"subprocess timed out after {timeout}s; "
-            f"stderr tail: {(ex.stderr or '')[-400:]!r}")
-
-    if proc.returncode != 0:
-        raise RuntimeError(
-            f"subprocess exited {proc.returncode}; "
-            f"stderr tail: {proc.stderr[-400:]!r}")
-
-    payload = _extract_sentinel_payload(proc.stdout)
-    if payload is None:
-        raise RuntimeError(
-            "child produced no sentinel-framed JSON; "
-            f"stderr tail: {proc.stderr[-400:]!r}")
-
-    return RepeatResult(
-        repeat_index=repeat_index,
-        completed_behaviors=int(payload["completed_behaviors"]),
-        elapsed_s=float(payload["elapsed_s"]),
-        throughput=float(payload["throughput"]),
-        wall_clock_ns_start=int(payload["wall_clock_ns_start"]))
-
-
-def _extract_sentinel_payload(stdout: str) -> Optional[dict]:
-    """Find and parse exactly one sentinel-framed JSON object.
-
-    :param stdout: The captured child stdout.
-    :return: The parsed payload, or ``None`` if no valid frame.
-    """
-    begin = stdout.find(SENTINEL_BEGIN)
-    end = stdout.find(SENTINEL_END)
-    if begin < 0 or end < 0 or end < begin:
-        return None
-    inner = stdout[begin + len(SENTINEL_BEGIN):end].strip()
-    try:
-        return json.loads(inner)
-    except json.JSONDecodeError:
-        return None
-
-
-# ---------------------------------------------------------------------------
-# Sweep orchestration (parent side)
-# ---------------------------------------------------------------------------
-
-
-def cfg_for_axis(base: BenchConfig, axis: str, value) -> BenchConfig:
-    """Clone ``base`` with one axis varied to ``value``.
-
-    :param base: The base config.
-    :param axis: One of ``workers``, ``iters``, ``group-size``,
-        ``payload``, ``none``.
-    :param value: The axis value (an ``int`` for most axes; a
-        ``(rows, cols)`` tuple for ``payload``).
-    :return: A fresh ``BenchConfig`` with that axis applied.
-    """
-    cfg = BenchConfig(**asdict(base))
-    # Reset auto-sized fields so each point recomputes.
-    cfg.rings = base.rings
-    cfg.chains_per_ring = base.chains_per_ring
-    if axis == "workers":
-        cfg.workers = int(value)
-        cfg.rings = None
-        cfg.chains_per_ring = None
-    elif axis == "iters":
-        cfg.iters = int(value)
-    elif axis == "group-size":
-        cfg.group_size = int(value)
-        cfg.chains_per_ring = None
-        cfg.rings = None
-    elif axis == "payload":
-        cfg.payload_rows, cfg.payload_cols = value
-    elif axis == "none":
-        pass
-    else:
-        raise ValueError(f"unknown axis: {axis}")
-    return derive_sizes(cfg)
-
-
-def summarize_repeats(reps: list) -> dict:
-    """Compute mean/stdev/min/max across repeats with the null-stdev rule.
-
-    With fewer than 2 repeats, ``stdev`` / ``min`` / ``max`` are
-    emitted as JSON null rather than zero, to avoid false zero-height
-    error bars in downstream plots.
-
-    :param reps: A list of ``RepeatResult``.
-    :return: A dict with mean, stdev, min, max.
-    """
-    if not reps:
-        return {"mean": None, "stdev": None, "min": None, "max": None}
-    throughputs = [r.throughput for r in reps]
-    if len(throughputs) < 2:
-        return {"mean": throughputs[0], "stdev": None,
-                "min": None, "max": None}
-    return {
-        "mean": statistics.fmean(throughputs),
-        "stdev": statistics.stdev(throughputs),
-        "min": min(throughputs),
-        "max": max(throughputs),
-    }
-
-
-def run_sweep(axis: str, values: list, base: BenchConfig,
-              git_sha: Optional[str], output_path: str,
-              metadata: dict) -> dict:
-    """Run a sweep, flushing JSON to disk after every point.
-
-    :param axis: Sweep axis name.
-    :param values: Per-axis values in order.
-    :param base: Base configuration.
-    :param git_sha: Optional git sha to forward to children.
-    :param output_path: Destination JSON file.
-    :param metadata: Initial metadata dict (will be updated with
-        ``finished_at`` at end).
-    :return: The final results dict (also written to disk).
-    """
-    points = []
-    fixed = asdict(base)
-    fixed.pop("workers", None) if axis == "workers" else None
-    rendered_values = [list(v) if isinstance(v, tuple) else v for v in values]
-    sweep_meta = {"axis": axis, "values": rendered_values, "fixed": fixed}
-
-    interrupted = False
-    for value in values:
-        cfg = cfg_for_axis(base, axis, value)
-        err = validate_config(cfg)
-        inputs = asdict(cfg)
-        if err is not None:
-            point = PointResult(inputs=inputs,
-                                error={"message": err, "stderr_tail": ""})
-            points.append(asdict(point))
-            print(f"point {axis}={value}: validation error: {err}",
-                  file=sys.stderr)
-            _flush_results(output_path, metadata, sweep_meta, points)
-            continue
-
-        repeats: list = []
-        try:
-            for r in range(base.repeats):
-                print(f"point {axis}={value} repeat {r + 1}/{base.repeats}: "
-                      "spawning child...", file=sys.stderr)
-                try:
-                    rep = run_in_subprocess(cfg, r, git_sha)
-                    repeats.append(rep)
-                    print(f"  -> {rep.throughput:.1f} behaviors/s "
-                          f"({rep.completed_behaviors} in "
-                          f"{rep.elapsed_s:.2f}s)", file=sys.stderr)
-                except RuntimeError as ex:
-                    point = PointResult(
-                        inputs=inputs,
-                        repeats=[asdict(r) for r in repeats],
-                        error={"message": str(ex), "stderr_tail": ""})
-                    points.append(asdict(point))
-                    _flush_results(output_path, metadata, sweep_meta, points)
-                    repeats = None  # marker: already appended
-                    break
-        except KeyboardInterrupt:
-            interrupted = True
-            metadata["interrupted"] = True
-            if repeats:
-                point = PointResult(
-                    inputs=inputs,
-                    repeats=[asdict(r) for r in repeats],
-                    error={"message": "interrupted", "stderr_tail": ""})
-                points.append(asdict(point))
-            _flush_results(output_path, metadata, sweep_meta, points)
-            break
-
-        if repeats is None:
-            continue
-
-        summary = summarize_repeats(repeats)
-        point = PointResult(
-            inputs=inputs,
-            repeats=[asdict(r) for r in repeats],
-            throughput_mean=summary["mean"],
-            throughput_stdev=summary["stdev"],
-            throughput_min=summary["min"],
-            throughput_max=summary["max"])
-        points.append(asdict(point))
-        _flush_results(output_path, metadata, sweep_meta, points)
-
-    metadata["finished_at"] = datetime.now().isoformat(timespec="seconds")
-    metadata["interrupted"] = interrupted or metadata.get("interrupted", False)
-    final = _flush_results(output_path, metadata, sweep_meta, points)
-    return final
-
-
-def _flush_results(path: str, metadata: dict, sweep_meta: dict,
-                   points: list) -> dict:
-    """Atomic write of the results JSON; falls back to in-place on Windows.
-
-    :param path: Destination file path.
-    :param metadata: Top-level metadata dict.
-    :param sweep_meta: Sweep description dict.
-    :param points: List of point dicts.
-    :return: The full results document that was written.
-    """
-    document = {
-        "schema_version": SCHEMA_VERSION,
-        "metadata": metadata,
-        "sweep": sweep_meta,
-        "points": points,
-    }
-    serialized = json.dumps(document, indent=2, default=_json_default)
-    os.makedirs(os.path.dirname(os.path.abspath(path)) or ".", exist_ok=True)
-    tmp = path + ".tmp"
-    with open(tmp, "w", encoding="utf-8") as f:
-        f.write(serialized)
-    delays = (0.05, 0.1, 0.2)
-    for attempt, delay in enumerate(delays):
-        try:
-            os.replace(tmp, path)
-            return document
-        except PermissionError:
-            if attempt == len(delays) - 1:
-                print(f"warning: atomic rename failed after {len(delays)} "
-                      "attempts; falling back to in-place overwrite",
-                      file=sys.stderr)
-                with open(path, "w", encoding="utf-8") as f:
-                    f.write(serialized)
-                try:
-                    os.unlink(tmp)
-                except OSError:
-                    pass
-                return document
-            time.sleep(delay)
-    return document
-
-
-def _json_default(obj):
-    """Coerce non-JSON-native objects (e.g. tuples) for serialization.
-
-    :param obj: An object json.dumps could not serialize natively.
-    :return: A JSON-serializable representation.
-    """
-    if isinstance(obj, (set, frozenset)):
-        return list(obj)
-    raise TypeError(f"object of type {type(obj).__name__} is not "
-                    "JSON-serializable")
-
-
-# ---------------------------------------------------------------------------
-# Metadata
-# ---------------------------------------------------------------------------
-
-
-def collect_metadata(argv: list, git_sha: Optional[str]) -> dict:
-    """Collect metadata for the top of the results JSON.
-
-    :param argv: The parent's ``sys.argv``.
-    :param git_sha: The git sha (or None).
-    :return: A metadata dict.
-    """
-    try:
-        bocpy_version = _read_bocpy_version()
-    except Exception:
-        bocpy_version = None
-
-    free_threaded = bool(getattr(sys, "_is_gil_enabled",
-                                 lambda: True)() is False)
-    return {
-        "hostname": socket.gethostname(),
-        "platform": sys.platform,
-        "cpu_count": os.cpu_count() or 0,
-        "python_version": sys.version.split()[0],
-        "python_implementation": sys.implementation.name,
-        "free_threaded": free_threaded,
-        "bocpy_version": bocpy_version,
-        "git_sha": git_sha,
-        "started_at": datetime.now().isoformat(timespec="seconds"),
-        "finished_at": None,
-        "argv": list(argv),
-        "interrupted": False,
-    }
-
-
-def _read_bocpy_version() -> Optional[str]:
-    """Best-effort read of bocpy's version from importlib.metadata.
-
-    :return: Version string or None on failure.
-    """
-    try:
-        from importlib.metadata import version
-        return version("bocpy")
-    except Exception:
-        return None
-
-
-def _git_sha() -> Optional[str]:
-    """Read git sha if available; cheap-and-fail-quietly.
-
-    :return: A 12-char abbreviated sha, or None.
-    """
-    cached = os.environ.get("BOCPY_BENCH_GIT_SHA")
-    if cached:
-        return cached
-    try:
-        out = subprocess.run(
-            ["git", "rev-parse", "--short=12", "HEAD"],
-            capture_output=True, text=True, timeout=5, check=False)
-        if out.returncode == 0:
-            return out.stdout.strip() or None
-    except (FileNotFoundError, subprocess.TimeoutExpired):
-        pass
-    return None
-
-
-# ---------------------------------------------------------------------------
-# ASCII table renderer
-# ---------------------------------------------------------------------------
-
-
-def render_table(document: dict) -> str:
-    """Render a compact ASCII summary table from a results document.
-
-    :param document: A loaded results JSON.
-    :return: A multi-line string ready to print.
-    """
-    axis = document["sweep"]["axis"]
-    points = document["points"]
-    interrupted = document.get("metadata", {}).get("interrupted", False)
-
-    lines = []
-    show_speedup = axis == "workers"
-    baseline = None
-    if show_speedup and points:
-        first = points[0]
-        if interrupted or first.get("error") is not None \
-                or first.get("throughput_mean") is None:
-            show_speedup = False
-            lines.append("note: speedup/efficiency suppressed (baseline "
-                         "missing, errored, or interrupted run)")
-        else:
-            baseline = first["throughput_mean"]
-
-    headers = [axis, "throughput", "stdev"]
-    if show_speedup:
-        headers += ["speedup", "efficiency"]
-    rows = []
-    for pt in points:
-        if pt.get("error") is not None:
-            row = [_axis_label(axis, pt), "ERROR", "-"]
-            if show_speedup:
-                row += ["-", "-"]
-            rows.append(row)
-            continue
-        mean = pt.get("throughput_mean")
-        stdev = pt.get("throughput_stdev")
-        row = [
-            _axis_label(axis, pt),
-            f"{mean:.1f}" if mean is not None else "-",
-            f"{stdev:.1f}" if stdev is not None else "-",
-        ]
-        if show_speedup:
-            speedup = (mean / baseline) if mean and baseline else None
-            workers = pt["inputs"]["workers"]
-            efficiency = (speedup / workers) if speedup and workers else None
-            row += [
-                f"{speedup:.2f}x" if speedup is not None else "-",
-                f"{efficiency:.0%}" if efficiency is not None else "-",
-            ]
-        rows.append(row)
-
-    widths = [max(len(h), max((len(r[i]) for r in rows), default=0))
-              for i, h in enumerate(headers)]
-    sep = "-+-".join("-" * w for w in widths)
-    lines.append(" | ".join(h.ljust(widths[i]) for i, h in enumerate(headers)))
-    lines.append(sep)
-    for r in rows:
-        lines.append(" | ".join(r[i].ljust(widths[i]) for i in range(len(r))))
-    return "\n".join(lines)
-
-
-def _axis_label(axis: str, pt: dict) -> str:
-    """Render the axis cell value for a point row.
-
-    :param axis: Sweep axis name.
-    :param pt: A point dict.
-    :return: A string for the axis column.
-    """
-    inputs = pt.get("inputs", {})
-    if axis == "workers":
-        return str(inputs.get("workers"))
-    if axis == "iters":
-        return str(inputs.get("iters"))
-    if axis == "group-size":
-        return str(inputs.get("group_size"))
-    if axis == "payload":
-        return f"{inputs.get('payload_rows')}x{inputs.get('payload_cols')}"
-    return "-"
-
-
-# ---------------------------------------------------------------------------
-# CLI
-# ---------------------------------------------------------------------------
-
-
-def parse_payload_token(token: str) -> tuple:
-    """Parse a payload token of the form ``"<rows>x<cols>"``.
-
-    :param token: The CLI token.
-    :return: A ``(rows, cols)`` tuple.
-    """
-    if "x" not in token:
-        raise argparse.ArgumentTypeError(
-            f"payload value {token!r} must look like '<rows>x<cols>'")
-    rs, cs = token.split("x", 1)
-    try:
-        rows, cols = int(rs), int(cs)
-    except ValueError:
-        raise argparse.ArgumentTypeError(
-            f"payload value {token!r}: rows and cols must be integers")
-    if rows < 1 or cols < 1:
-        raise argparse.ArgumentTypeError(
-            f"payload value {token!r}: rows and cols must be >= 1")
-    return (rows, cols)
-
-
-def parse_sweep_values(axis: str, raw: Optional[str]) -> list:
-    """Parse ``--sweep-values`` per-axis at argparse time.
-
-    :param axis: The sweep axis.
-    :param raw: The raw CSV string, or None.
-    :return: A list of values appropriate for the axis.
-    """
-    if axis == "none":
-        if raw:
-            raise argparse.ArgumentTypeError(
-                "--sweep-values must be empty when --sweep-axis is 'none'")
-        return [None]
-    if raw is None:
-        return _default_sweep_values(axis)
-    tokens = [t.strip() for t in raw.split(",") if t.strip()]
-    if not tokens:
-        return _default_sweep_values(axis)
-    if axis in ("workers", "iters", "group-size"):
-        out = []
-        for t in tokens:
-            try:
-                out.append(int(t))
-            except ValueError:
-                raise argparse.ArgumentTypeError(
-                    f"--sweep-values: token {t!r} is not an integer "
-                    f"(axis={axis})")
-        return out
-    if axis == "payload":
-        return [parse_payload_token(t) for t in tokens]
-    raise argparse.ArgumentTypeError(f"unknown axis: {axis}")
-
-
-def _default_sweep_values(axis: str) -> list:
-    """Return the documented default sweep values for an axis.
-
-    :param axis: The sweep axis name.
-    :return: A list of default values.
-    """
-    cpu = os.cpu_count() or 1
-    if axis == "workers":
-        return sorted(set([1, 2, 4, 8, min(16, cpu)]))
-    if axis == "iters":
-        return [250, 500, 1000, 2000, 4000, 8000]
-    if axis == "group-size":
-        return [1, 2, 4, 8]
-    if axis == "payload":
-        return [(4, 4), (8, 8), (16, 16), (32, 32), (64, 64)]
-    return []
-
-
-def build_arg_parser() -> argparse.ArgumentParser:
-    """Build the CLI argument parser.
-
-    :return: A configured ``argparse.ArgumentParser``.
-    """
-    p = argparse.ArgumentParser(
-        prog="bocpy.examples.benchmark",
-        description="Microbenchmark for the BOC runtime.")
-    p.add_argument("--workers", type=int, default=None)
-    p.add_argument("--sweep-axis",
-                   choices=("workers", "iters", "group-size", "payload",
-                            "none"),
-                   default="workers")
-    p.add_argument("--sweep-values", default=None)
-    p.add_argument("--duration", type=float, default=5.0)
-    p.add_argument("--warmup", type=float, default=None)
-    p.add_argument("--iters", type=int, default=2000)
-    p.add_argument("--group-size", type=int, default=2, dest="group_size")
-    p.add_argument("--stride", type=int, default=1)
-    p.add_argument("--rings", type=int, default=None)
-    p.add_argument("--chains-per-ring", type=int, default=None,
-                   dest="chains_per_ring")
-    p.add_argument("--ring-size", type=int, default=128, dest="ring_size")
-    p.add_argument("--payload-rows", type=int, default=16,
-                   dest="payload_rows")
-    p.add_argument("--payload-cols", type=int, default=16,
-                   dest="payload_cols")
-    p.add_argument("--repeats", type=int, default=1)
-    p.add_argument("--null-payload", dest="null_payload",
-                   action="store_true", default=False,
-                   help="Skip the matmul inner loop in each behavior. "
-                        "Throughput then reflects pure BOC runtime "
-                        "overhead with the application work removed.")
-    p.add_argument("--output", default=None)
-    p.add_argument("--table", dest="table", action="store_true", default=None)
-    p.add_argument("--no-table", dest="table", action="store_false")
-    p.add_argument("--quiet", action="store_true")
-    p.add_argument("--json-stdout", action="store_true",
-                   help="Run a single point and print sentinel-framed "
-                        "JSON to stdout (subprocess internal).")
-    p.add_argument("--print-table", default=None,
-                   help="Print a table from an existing JSON file and exit.")
-    return p
-
-
-def args_to_base_cfg(args) -> BenchConfig:
-    """Build a base ``BenchConfig`` from parsed CLI args.
-
-    :param args: The parsed argparse namespace.
-    :return: A ``BenchConfig`` (not yet derived).
-    """
-    workers = args.workers if args.workers is not None else 1
-    warmup = args.warmup
-    if warmup is None:
-        warmup = min(1.0, args.duration * 0.1)
-    return BenchConfig(
-        workers=workers,
-        duration=args.duration,
-        warmup=warmup,
-        iters=args.iters,
-        group_size=args.group_size,
-        stride=args.stride,
-        rings=args.rings,
-        chains_per_ring=args.chains_per_ring,
-        ring_size=args.ring_size,
-        payload_rows=args.payload_rows,
-        payload_cols=args.payload_cols,
-        repeats=args.repeats,
-        null_payload=args.null_payload,
-    )
-
-
-def child_main(args) -> int:
-    """Run a single point and emit a sentinel-framed JSON object.
-
-    Used by ``run_in_subprocess``.  The child does **not** run the
-    cross-worker validation gate — that runs once in the parent before
-    any sweep child is spawned.
-
-    :param args: The parsed argparse namespace.
-    :return: Process exit code.
-    """
-    cfg = derive_sizes(args_to_base_cfg(args))
-    err = validate_config(cfg)
-    if err is not None:
-        print(f"benchmark: invalid config: {err}", file=sys.stderr)
-        return 2
-    emit_soft_warnings(cfg, os.cpu_count() or 1)
-    rep = run_single_point_body(cfg, repeat_index=0)
-    payload = {
-        "inputs": asdict(cfg),
-        "completed_behaviors": rep.completed_behaviors,
-        "elapsed_s": rep.elapsed_s,
-        "throughput": rep.throughput,
-        "wall_clock_ns_start": rep.wall_clock_ns_start,
-    }
-    sys.stdout.write("\n" + SENTINEL_BEGIN + "\n")
-    sys.stdout.write(json.dumps(payload, default=_json_default))
-    sys.stdout.write("\n" + SENTINEL_END + "\n")
-    sys.stdout.flush()
-    return 0
-
-
-def parent_main(args) -> int:
-    """Run a sweep across the requested axis.
-
-    :param args: The parsed argparse namespace.
-    :return: Process exit code.
-    """
-    base = args_to_base_cfg(args)
-    try:
-        sweep_values = parse_sweep_values(args.sweep_axis, args.sweep_values)
-    except argparse.ArgumentTypeError as ex:
-        print(f"benchmark: {ex}", file=sys.stderr)
-        return 2
-
-    # Pre-spawn validation across every sweep point.
-    cpu = os.cpu_count() or 1
-    derived_points = []
-    for value in sweep_values:
-        cfg = cfg_for_axis(base, args.sweep_axis, value)
-        err = validate_config(cfg)
-        if err is not None:
-            print(f"benchmark: sweep point {args.sweep_axis}={value} "
-                  f"invalid: {err}", file=sys.stderr)
-            return 2
-        emit_soft_warnings(cfg, cpu)
-        derived_points.append(cfg)
-
-    git_sha = _git_sha()
-
-    # Wall-clock estimate for sweep duration.
-    startup_slack = 5.0
-    est = sum((cfg.duration + cfg.warmup + startup_slack) * base.repeats
-              for cfg in derived_points)
-    print(f"sweep estimate: {len(derived_points)} points "
-          f"x {base.repeats} repeats ~ {est:.0f}s wall clock",
-          file=sys.stderr)
-
-    output_path = args.output or _default_output_path()
-    metadata = collect_metadata(sys.argv, git_sha)
-    document = run_sweep(args.sweep_axis, sweep_values, base,
-                         git_sha, output_path, metadata)
-
-    if args.table is None:
-        show_table = sys.stdout.isatty()
-    else:
-        show_table = args.table
-    if show_table and not args.quiet:
-        print(render_table(document))
-    if not args.quiet:
-        print(f"results: {output_path}", file=sys.stderr)
-    return 0
-
-
-def _default_output_path() -> str:
-    """Compute the default output path under ``results/``.
-
-    Uses ``%Y%m%dT%H%M%S`` rather than ``isoformat()`` so the filename
-    is valid on Windows (no colons).
-
-    :return: A path string.
-    """
-    ts = datetime.now().strftime("%Y%m%dT%H%M%S")
-    host = socket.gethostname().replace(os.sep, "_")
-    return os.path.join("results", f"benchmark-{host}-{ts}.json")
-
-
-def main() -> int:
-    """CLI entry point.
-
-    :return: Process exit code.
-    """
-    if sys.version_info < (3, 12):
-        sys.exit("bocpy benchmarks require Python 3.12+ for "
-                 "sub-interpreter parallelism")
-
-    parser = build_arg_parser()
-    args = parser.parse_args()
-
-    if args.print_table is not None:
-        with open(args.print_table, encoding="utf-8") as f:
-            document = json.load(f)
-        print(render_table(document))
-        return 0
-
-    if args.json_stdout:
-        return child_main(args)
-
-    return parent_main(args)
-
-
-if __name__ == "__main__":
-    sys.exit(main())
+"""Chain-ring microbenchmark for the BOC runtime.
+
+This benchmark measures *BOC runtime scaling* (scheduler, 2PL, message
+queue, sub-interpreter crossings, return-cown allocation) in isolation
+from any application-specific serial work.  It is **not** a measure of
+how well your own application will scale: real applications carry
+serial costs (data structure construction, scheduling logic,
+result drainage) that this benchmark deliberately eliminates.
+
+A few load-bearing caveats baked into the design:
+
+* Each behavior allocates a fresh return ``Cown`` (the auto-generated
+  one returned by ``@when``).  At thousands of behaviors per second
+  this is a real, version-dependent constant in every sample.
+* ``ChainState`` crosses the interpreter boundary via XIData on every
+  reschedule; for tiny payloads, marshaling can rival the useful work.
+* The ``group-size`` sweep varies acquired-set cardinality and CPU work
+  together (the inner loop multiplies every window slot into
+  ``window[0]``, ``iters * group_size`` matrix multiplies per
+  behavior).  It is not an isolated 2PL-cost knob.
+"""
+
+import argparse
+import json
+import os
+import socket
+import statistics
+import subprocess
+import sys
+import time
+from dataclasses import asdict, dataclass, field
+from datetime import datetime
+from typing import Optional
+
+from bocpy import (Cown, Matrix, noticeboard, notice_write, receive, send,
+                   start, wait, when)
+
+# Sentinels for the parent/child JSON protocol.  Uppercase so the
+# transpiler keeps them as module-level constants in the worker export.
+SENTINEL_BEGIN = "---BOCPY-BENCH-BEGIN---"
+SENTINEL_END = "---BOCPY-BENCH-END---"
+SCHEMA_VERSION = 1
+
+
+# ---------------------------------------------------------------------------
+# Behavior code (chain workload)
+# ---------------------------------------------------------------------------
+
+
+class ChainState:
+    """Per-chain mutable state carried inside a ``Cown[ChainState]``.
+
+    Holds ints only.  The chain's ring of ``Cown[Matrix]`` lives in the
+    noticeboard under ``f"ring_{ring_id}"`` so it is materialized once
+    per worker (and cached for the lifetime of ``NB_VERSION``) instead
+    of being marshaled through XIData on every reschedule.
+    """
+
+    def __init__(self, chain_id: int, ring_id: int, head_idx: int,
+                 iters: int, stride: int, ring_size: int):
+        """Initialize a chain state.
+
+        :param chain_id: A unique id within the workload.
+        :param ring_id: Index of the ring this chain runs on.  Must
+            correspond to a ``f"ring_{ring_id}"`` entry already
+            written to the noticeboard.
+        :param head_idx: Initial head position on the ring.
+        :param iters: Inner-loop matrix multiplications per window slot.
+        :param stride: Step between successive windows.
+        :param ring_size: Number of cowns on the ring.
+        """
+        self.chain_id = chain_id
+        self.ring_id = ring_id
+        self.head_idx = head_idx
+        self.count = 0
+        self.iters = iters
+        self.stride = stride
+        self.ring_size = ring_size
+
+
+def next_window(cs: "ChainState", group_size: int) -> list:
+    """Compute the next sliding window of cowns for a chain.
+
+    Reads the chain's ring from the noticeboard.  Must be called from
+    inside a behavior so that ``noticeboard()`` returns the cached
+    snapshot for the current ``NB_VERSION``.
+
+    :param cs: The chain state.
+    :param group_size: Number of adjacent cowns in the window.
+    :return: ``list[Cown[Matrix]]`` for the next acquired set.
+    """
+    ring = noticeboard()[f"ring_{cs.ring_id}"]
+    return [ring[(cs.head_idx + i * cs.stride) % cs.ring_size]
+            for i in range(group_size)]
+
+
+def schedule_step(state_cown: Cown, window_list: list, group_size: int) -> None:
+    """Schedule one chain step with the given window.
+
+    The static ``@when`` decorator inside this helper is rewritten by
+    the transpiler into a ``whencall`` invocation, so this function
+    works correctly when called from a worker sub-interpreter (where
+    the Python ``when`` decorator is not wired up).
+
+    :param state_cown: The chain's state cown.
+    :param window_list: Adjacent cowns to acquire for this step.
+    :param group_size: Window size, captured into the behavior.
+    """
+    @when(state_cown, window_list)
+    def _step(state, window):
+        cs = state.value
+        # When ``cr_null`` is set, skip the matmul loop entirely.  The
+        # behavior still acquires its window of cowns, mutates
+        # ``ChainState``, and reschedules itself — so the measured
+        # throughput reflects pure BOC runtime overhead (2PL, queue
+        # ops, sub-interpreter crossings, return-cown allocation)
+        # with the application work removed.
+        if not noticeboard().get("cr_null", False):
+            # The inner loop's first slot multiplies window[0] by itself.
+            # Intentional — it keeps the per-behavior multiply count
+            # exactly `iters * group_size`.
+            for _ in range(cs.iters):
+                for c in window:
+                    window[0].value = window[0].value @ c.value
+
+        cs.count += 1
+        cs.head_idx = (cs.head_idx + cs.stride) % cs.ring_size
+        if not noticeboard().get("cr_stop", False):
+            # Pass the already-acquired `state` cown wrapper directly
+            # rather than the closure-captured `state_cown` to keep the
+            # capture set minimal.
+            schedule_step(state, next_window(cs, group_size), group_size)
+
+
+# ---------------------------------------------------------------------------
+# Configuration and result types (plain data only; no Cowns)
+# ---------------------------------------------------------------------------
+
+
+@dataclass
+class BenchConfig:
+    """Plain-data benchmark configuration.
+
+    Holds only ints / floats / strings / lists of the same so that an
+    instance can stay live in ``main()``'s frame across ``wait()``
+    without ``stop_workers`` finding any bare Cowns to acquire.
+    """
+
+    workers: int = 1
+    duration: float = 5.0
+    warmup: float = 1.0
+    iters: int = 2000
+    group_size: int = 2
+    stride: int = 1
+    rings: Optional[int] = None
+    chains_per_ring: Optional[int] = None
+    ring_size: int = 128
+    payload_rows: int = 16
+    payload_cols: int = 16
+    repeats: int = 1
+    null_payload: bool = False
+
+
+@dataclass
+class RepeatResult:
+    """Plain-data result for a single repeat of one sweep point."""
+
+    repeat_index: int
+    completed_behaviors: int
+    elapsed_s: float
+    throughput: float
+    wall_clock_ns_start: int
+    scheduler_stats: Optional[list] = None
+    queue_stats: Optional[list] = None
+    # ``derived`` holds the post-processed metrics computed from the
+    # per-window scheduler-stats delta (see
+    # ``compute_derived_metrics``).
+    derived: Optional[dict] = None
+
+
+@dataclass
+class PointResult:
+    """Plain-data result for a single sweep point."""
+
+    inputs: dict
+    repeats: list = field(default_factory=list)
+    throughput_mean: Optional[float] = None
+    throughput_stdev: Optional[float] = None
+    throughput_min: Optional[float] = None
+    throughput_max: Optional[float] = None
+    error: Optional[dict] = None
+
+
+# ---------------------------------------------------------------------------
+# Sizing / validation helpers (parent-side, no BOC required)
+# ---------------------------------------------------------------------------
+
+
+def derive_sizes(cfg: BenchConfig) -> BenchConfig:
+    """Auto-size ``rings`` and ``chains_per_ring`` if not overridden.
+
+    :param cfg: An input config (mutated and returned).
+    :return: The same config with ``rings`` / ``chains_per_ring`` set.
+    """
+    if cfg.chains_per_ring is None:
+        cfg.chains_per_ring = max(
+            1, cfg.ring_size // (cfg.group_size * cfg.stride * 2))
+    if cfg.rings is None:
+        cfg.rings = max(cfg.workers * 4 // cfg.chains_per_ring,
+                        cfg.workers * 2)
+    return cfg
+
+
+def validate_config(cfg: BenchConfig) -> Optional[str]:
+    """Validate a fully-derived config; return an error string or None.
+
+    Hard errors only.  Soft warnings (``duration < 1.0``, oversubscribed
+    workers) are emitted by the caller rather than failing here.
+
+    :param cfg: A config with ``rings`` and ``chains_per_ring`` set.
+    :return: An error message, or ``None`` if the config is valid.
+    """
+    if cfg.group_size * cfg.stride * 2 > cfg.ring_size:
+        return (f"group_size*stride*2 ({cfg.group_size}*{cfg.stride}*2) "
+                f"> ring_size ({cfg.ring_size}); chains would collide")
+    if cfg.workers < 1:
+        return f"workers must be >= 1, got {cfg.workers}"
+    if cfg.iters < 1:
+        return f"iters must be >= 1, got {cfg.iters}"
+    if cfg.payload_rows < 1 or cfg.payload_cols < 1:
+        return "payload dimensions must be >= 1"
+    if cfg.duration <= 0 or cfg.warmup < 0:
+        return "duration must be > 0 and warmup must be >= 0"
+    return None
+
+
+def emit_soft_warnings(cfg: BenchConfig, cpu_count: int) -> None:
+    """Print soft warnings for unusual configs to stderr.
+
+    :param cfg: The fully-derived config.
+    :param cpu_count: Detected CPU count for oversubscription check.
+    """
+    if cfg.duration < 1.0:
+        print(f"warning: duration={cfg.duration}s is short; results will "
+              "be noisy", file=sys.stderr)
+    if cfg.workers > cpu_count:
+        print(f"warning: workers={cfg.workers} exceeds cpu_count="
+              f"{cpu_count}; oversubscribed", file=sys.stderr)
+
+
+# ---------------------------------------------------------------------------
+# Workload construction
+# ---------------------------------------------------------------------------
+
+
+def build_workload(cfg: BenchConfig):
+    """Build per-ring cowns and per-chain state cowns.
+
+    Each ring is published to the noticeboard under ``f"ring_{r}"``.
+    Workers read it back via ``noticeboard()`` inside ``_step``; the
+    noticeboard's per-worker version-cache means the ring is
+    materialized once per worker per ``NB_VERSION`` instead of being
+    marshaled through XIData on every reschedule.
+
+    :param cfg: A fully-derived config.
+    :return: A ``(rings, state_cowns)`` tuple.  ``rings`` is
+        ``list[list[Cown[Matrix]]]``; ``state_cowns`` is
+        ``list[Cown[ChainState]]``.  Both containers are invisible to
+        ``stop_workers`` (it does not recurse into containers).
+    """
+    rings = []
+    state_cowns = []
+    chain_id = 0
+    for r in range(cfg.rings):
+        ring = [Cown(Matrix.uniform(0.0, 1.0,
+                                    (cfg.payload_rows, cfg.payload_cols)))
+                for _ in range(cfg.ring_size)]
+        rings.append(ring)
+        notice_write(f"ring_{r}", ring)
+        # Spread chains evenly across the ring so adjacent chains'
+        # initial windows don't overlap.
+        spacing = max(1, cfg.ring_size // cfg.chains_per_ring)
+        for k in range(cfg.chains_per_ring):
+            head = (k * spacing) % cfg.ring_size
+            cs = ChainState(chain_id=chain_id, ring_id=r, head_idx=head,
+                            iters=cfg.iters, stride=cfg.stride,
+                            ring_size=cfg.ring_size)
+            state_cowns.append(Cown(cs))
+            chain_id += 1
+    return rings, state_cowns
+
+
+# ---------------------------------------------------------------------------
+# Snapshot helpers (used by the measurement flow)
+# ---------------------------------------------------------------------------
+
+
+def schedule_snap(state_cowns: list) -> None:
+    """Schedule the final snapshot + publish behaviors.
+
+    See the module docstring for the snap ordering invariant.  This
+    helper is structured so that the bare ``snap`` and ``_publish``
+    return-cown locals fall out of scope at its return boundary,
+    satisfying the no-bare-Cowns-in-main rule before ``wait()`` runs.
+
+    :param state_cowns: Every chain's state cown.
+    """
+    @when(state_cowns)
+    def snap(states):
+        return sum(s.value.count for s in states)
+
+    notice_write("cr_stop", True)
+
+    @when(snap)
+    def _publish(s):
+        send("snap", s.value)
+
+
+def emit_chain_snapshot(state_cown: Cown, tag: str) -> None:
+    """Send a chain's ``(count, head_idx)`` over the queue under ``tag``.
+
+    Used by tests that need to inspect chain progress directly.  The
+    helper lives in this module so the ``@when`` decorator runs through
+    the transpiler that registered ``schedule_step``.
+
+    :param state_cown: The chain's state cown.
+    :param tag: The tag to ``send`` the snapshot under.
+    """
+    @when(state_cown)
+    def _emit(s):
+        send(tag, (s.value.count, s.value.head_idx))
+
+
+# ---------------------------------------------------------------------------
+# Single-point measurement (in-process; one BOC start/wait cycle)
+# ---------------------------------------------------------------------------
+
+
+def run_single_point_body(cfg: BenchConfig, repeat_index: int) -> RepeatResult:
+    """Run one chain-ring measurement in a fresh BOC runtime.
+
+    Snapshots ``_core.scheduler_stats()`` after warmup, then captures
+    the post-session snapshot via ``wait(stats=True)``. The **delta**
+    of the two is stored in ``RepeatResult.scheduler_stats`` so warmup
+    pushes do not pollute the per-window counters consumed by
+    ``compute_derived_metrics``.
+
+    :param cfg: The fully-derived config.
+    :param repeat_index: Index of this repeat for reporting.
+    :return: A ``RepeatResult`` with no Cown references.
+    """
+    # Start the runtime first: ``build_workload`` writes rings to the
+    # noticeboard, and noticeboard writes require the runtime to be
+    # running.
+    start(worker_count=cfg.workers)
+    rings, state_cowns = build_workload(cfg)
+    # Publish the null-payload toggle so worker behaviors can read it
+    # from their per-behavior noticeboard snapshot.  Written before the
+    # warmup sleep so the noticeboard thread has flushed it well
+    # before t_measure_start.
+    notice_write("cr_null", cfg.null_payload)
+    payload_bytes = cfg.payload_rows * cfg.payload_cols * 8
+    total_bytes = cfg.rings * cfg.ring_size * payload_bytes
+    print(f"workload: chain rings={cfg.rings} ring_size={cfg.ring_size} "
+          f"chains={cfg.rings * cfg.chains_per_ring} "
+          f"payload={cfg.payload_rows}x{cfg.payload_cols} "
+          f"(~{total_bytes / 1024:.1f} KiB matrix data)",
+          file=sys.stderr)
+
+    try:
+        # Kick off one chain per (ring, chain-slot) pair.  Recompute the
+        # head positions exactly the way `build_workload` chose them:
+        # we cannot read `cs_cown.value` from the main thread because
+        # Cowns are released to the runtime on construction.
+        spacing = max(1, cfg.ring_size // cfg.chains_per_ring)
+        chain_idx = 0
+        for r in range(cfg.rings):
+            for k in range(cfg.chains_per_ring):
+                cs_cown = state_cowns[chain_idx]
+                head = (k * spacing) % cfg.ring_size
+                window = [rings[r][(head + i * cfg.stride) % cfg.ring_size]
+                          for i in range(cfg.group_size)]
+                schedule_step(cs_cown, window, cfg.group_size)
+                chain_idx += 1
+
+        time.sleep(cfg.warmup)
+        from bocpy import _core
+        sched_stats_warm = _core.scheduler_stats()
+        wall_clock_ns_start = time.time_ns()
+        t_measure_start = time.perf_counter()
+        time.sleep(cfg.duration)
+
+        schedule_snap(state_cowns)
+        msg = receive(["snap"], 60.0 + cfg.duration)
+        t_snap_received = time.perf_counter()
+        if msg is None or msg[0] != "snap":
+            raise RuntimeError("snap behavior did not publish in time")
+        _, total = msg
+        elapsed_s = t_snap_received - t_measure_start
+
+        # Snapshot tagged-queue counters BEFORE wait() tears the
+        # runtime down. Per-tag assignments are rebound on the next
+        # start(), so capture here while they still reflect this run.
+        queue_stats_snap = (
+            _core.queue_stats() if hasattr(_core, "queue_stats") else None
+        )
+    finally:
+        # Drop bare-Cown locals before wait().
+        del rings
+        del state_cowns
+        # ``wait(stats=True)`` returns the per-worker scheduler_stats
+        # snapshot captured AFTER all behaviors completed but BEFORE
+        # the per-worker array is freed -- the only correct moment
+        # for a session-final snapshot.
+        sched_stats_end = wait(stats=True)
+
+    sched_stats_delta = _delta_scheduler_stats(sched_stats_warm,
+                                               sched_stats_end)
+    throughput = total / elapsed_s if elapsed_s > 0 else 0.0
+    return RepeatResult(repeat_index=repeat_index,
+                        completed_behaviors=int(total),
+                        elapsed_s=elapsed_s,
+                        throughput=throughput,
+                        wall_clock_ns_start=wall_clock_ns_start,
+                        scheduler_stats=sched_stats_delta,
+                        queue_stats=queue_stats_snap,
+                        derived=compute_derived_metrics(sched_stats_delta,
+                                                        int(total)))
+
+
+# ---------------------------------------------------------------------------
+# Stats-delta + derived metrics
+# ---------------------------------------------------------------------------
+
+
+# Counter fields in ``_core.scheduler_stats()`` that are monotonically
+# increasing per-worker counters and therefore subtractable across two
+# snapshots.  Non-counter fields (``last_steal_attempt_ns``,
+# ``parked``) are carried over from the end-of-window snapshot
+# unchanged because subtracting them is meaningless.
+_COUNTER_FIELDS = (
+    "pushed_local",
+    "dispatched_to_pending",
+    "pushed_remote",
+    "popped_local",
+    "popped_via_steal",
+    "enqueue_cas_retries",
+    "dequeue_cas_retries",
+    "batch_resets",
+    "steal_attempts",
+    "steal_failures",
+    "fairness_arm_fires",
+)
+
+
+def _delta_scheduler_stats(warm: Optional[list],
+                           end: Optional[list]) -> Optional[list]:
+    """Return per-worker ``end - warm`` for the monotonic counter fields.
+
+    Non-counter fields (``parked``, ``last_steal_attempt_ns``) are
+    copied from ``end`` unchanged.  If either snapshot is missing or
+    the worker counts disagree (for example because the runtime tore
+    down between snapshots), returns the end snapshot unchanged.
+
+    :param warm: End-of-warmup snapshot (per-worker dicts).
+    :param end: End-of-measurement-window snapshot.
+    :return: Per-worker delta dicts.
+    """
+    if not end:
+        return end
+    if not warm or len(warm) != len(end):
+        return end
+    out = []
+    for w, e in zip(warm, end):
+        d = dict(e)
+        for k in _COUNTER_FIELDS:
+            if k in e and k in w:
+                d[k] = int(e[k]) - int(w[k])
+        out.append(d)
+    return out
+
+
+def compute_derived_metrics(stats: Optional[list],
+                            completed_behaviors: int) -> dict:
+    """Compute the dispatch-contention metrics from a stats delta.
+
+    :param stats: Per-worker delta stats from ``_delta_scheduler_stats``.
+    :param completed_behaviors: Total completed behaviors over the
+        measurement window (matches the throughput numerator).
+    :return: A dict with ``producer_worker_index``,
+        ``enq_retry_ratio``, ``steal_yield``, ``idle_ratio``, and
+        ``producer_pushed_local`` so callers can reconstruct the
+        ratio's numerator / denominator without re-walking ``stats``.
+    """
+    out = {
+        "producer_worker_index": None,
+        "producer_pushed_local": 0,
+        "producer_enqueue_cas_retries": 0,
+        "enq_retry_ratio": None,
+        "steal_yield": None,
+        "idle_ratio": None,
+    }
+    if not stats:
+        return out
+    # Producer worker = the worker with the most local pushes over
+    # the measurement window. For chain that is whichever worker's
+    # queue saw the most ``schedule_fifo`` evicts of ``pending`` to
+    # ``q``.
+    pushed_local = [int(w.get("pushed_local", 0)) for w in stats]
+    if not pushed_local or max(pushed_local) == 0:
+        return out
+    p_idx = max(range(len(pushed_local)), key=lambda i: pushed_local[i])
+    p_pushed = pushed_local[p_idx]
+    p_enq_r = int(stats[p_idx].get("enqueue_cas_retries", 0))
+    out["producer_worker_index"] = p_idx
+    out["producer_pushed_local"] = p_pushed
+    out["producer_enqueue_cas_retries"] = p_enq_r
+    out["enq_retry_ratio"] = (p_enq_r / p_pushed) if p_pushed > 0 else None
+
+    total_steal = sum(int(w.get("popped_via_steal", 0)) for w in stats)
+    if completed_behaviors > 0:
+        out["steal_yield"] = total_steal / completed_behaviors
+
+    total_attempts = sum(int(w.get("steal_attempts", 0)) for w in stats)
+    total_failures = sum(int(w.get("steal_failures", 0)) for w in stats)
+    if total_attempts > 0:
+        out["idle_ratio"] = total_failures / total_attempts
+    return out
+
+
+# ---------------------------------------------------------------------------
+# Subprocess orchestration
+# ---------------------------------------------------------------------------
+
+
+def cfg_to_argv(cfg: BenchConfig) -> list:
+    """Render a ``BenchConfig`` as CLI args for a child invocation.
+
+    :param cfg: The config to serialize.
+    :return: A list of CLI arguments suitable for child invocation.
+    """
+    args = [
+        "--workers", str(cfg.workers),
+        "--duration", str(cfg.duration),
+        "--warmup", str(cfg.warmup),
+        "--iters", str(cfg.iters),
+        "--group-size", str(cfg.group_size),
+        "--stride", str(cfg.stride),
+        "--ring-size", str(cfg.ring_size),
+        "--payload-rows", str(cfg.payload_rows),
+        "--payload-cols", str(cfg.payload_cols),
+        "--repeats", "1",
+        "--sweep-axis", "none",
+    ]
+    if cfg.rings is not None:
+        args += ["--rings", str(cfg.rings)]
+    if cfg.chains_per_ring is not None:
+        args += ["--chains-per-ring", str(cfg.chains_per_ring)]
+    if cfg.null_payload:
+        args += ["--null-payload"]
+    return args
+
+
+# Sidechannel: the parent passes its --emit-scheduler-stats flag down
+# to the child via an env var so cfg_to_argv stays a pure function of
+# BenchConfig (the flag is a reporting concern, not a workload knob).
+BOCPY_BENCH_EMIT_SCHED_STATS_ENV = "BOCPY_BENCH_EMIT_SCHED_STATS"
+
+
+def run_in_subprocess(cfg: BenchConfig, repeat_index: int,
+                      git_sha: Optional[str]) -> RepeatResult:
+    """Run one repeat in a fresh subprocess and return its result.
+
+    On non-zero exit / timeout / missing sentinel, raises
+    ``RuntimeError`` with a stderr-tail diagnostic so the caller can
+    record an ``error`` entry on the point.
+
+    :param cfg: A fully-derived config with ``repeats`` ignored.
+    :param repeat_index: Index into the parent's ``repeats[]`` list.
+    :param git_sha: Optional git sha to forward to the child.
+    """
+    env = dict(os.environ)
+    if git_sha is not None:
+        env["BOCPY_BENCH_GIT_SHA"] = git_sha
+
+    extra = []
+    if env.get(BOCPY_BENCH_EMIT_SCHED_STATS_ENV) == "1":
+        extra.append("--emit-scheduler-stats")
+
+    cmd = [sys.executable, "-m", "bocpy.examples.benchmark",
+           "--json-stdout"] + cfg_to_argv(cfg) + extra
+    timeout = max(cfg.duration * 3 + 30, cfg.duration + cfg.warmup + 60)
+    try:
+        proc = subprocess.run(cmd, env=env, capture_output=True,
+                              text=True, timeout=timeout, check=False)
+    except subprocess.TimeoutExpired as ex:
+        raise RuntimeError(
+            f"subprocess timed out after {timeout}s; "
+            f"stderr tail: {(ex.stderr or '')[-400:]!r}")
+
+    if proc.returncode != 0:
+        raise RuntimeError(
+            f"subprocess exited {proc.returncode}; "
+            f"stderr tail: {proc.stderr[-400:]!r}")
+
+    payload = _extract_sentinel_payload(proc.stdout)
+    if payload is None:
+        raise RuntimeError(
+            "child produced no sentinel-framed JSON; "
+            f"stderr tail: {proc.stderr[-400:]!r}")
+
+    return RepeatResult(
+        repeat_index=repeat_index,
+        completed_behaviors=int(payload["completed_behaviors"]),
+        elapsed_s=float(payload["elapsed_s"]),
+        throughput=float(payload["throughput"]),
+        wall_clock_ns_start=int(payload["wall_clock_ns_start"]),
+        scheduler_stats=payload.get("scheduler_stats"),
+        queue_stats=payload.get("queue_stats"),
+        derived=payload.get("derived"))
+
+
+def _extract_sentinel_payload(stdout: str) -> Optional[dict]:
+    """Find and parse exactly one sentinel-framed JSON object.
+
+    :param stdout: The captured child stdout.
+    :return: The parsed payload, or ``None`` if no valid frame.
+    """
+    begin = stdout.find(SENTINEL_BEGIN)
+    end = stdout.find(SENTINEL_END)
+    if begin < 0 or end < 0 or end < begin:
+        return None
+    inner = stdout[begin + len(SENTINEL_BEGIN):end].strip()
+    try:
+        return json.loads(inner)
+    except json.JSONDecodeError:
+        return None
+
+
+# ---------------------------------------------------------------------------
+# Sweep orchestration (parent side)
+# ---------------------------------------------------------------------------
+
+
+def cfg_for_axis(base: BenchConfig, axis: str, value) -> BenchConfig:
+    """Clone ``base`` with one axis varied to ``value``.
+
+    :param base: The base config.
+    :param axis: One of ``workers``, ``iters``, ``group-size``,
+        ``payload``, ``none``.
+    :param value: The axis value (an ``int`` for most axes; a
+        ``(rows, cols)`` tuple for ``payload``).
+    :return: A fresh ``BenchConfig`` with that axis applied.
+    """
+    cfg = BenchConfig(**asdict(base))
+    # Reset auto-sized fields so each point recomputes.
+    cfg.rings = base.rings
+    cfg.chains_per_ring = base.chains_per_ring
+    if axis == "workers":
+        cfg.workers = int(value)
+        cfg.rings = None
+        cfg.chains_per_ring = None
+    elif axis == "iters":
+        cfg.iters = int(value)
+    elif axis == "group-size":
+        cfg.group_size = int(value)
+        cfg.chains_per_ring = None
+        cfg.rings = None
+    elif axis == "payload":
+        cfg.payload_rows, cfg.payload_cols = value
+    elif axis == "none":
+        pass
+    else:
+        raise ValueError(f"unknown axis: {axis}")
+    return derive_sizes(cfg)
+
+
+def summarize_repeats(reps: list) -> dict:
+    """Compute mean/stdev/min/max across repeats with the null-stdev rule.
+
+    With fewer than 2 repeats, ``stdev`` / ``min`` / ``max`` are
+    emitted as JSON null rather than zero, to avoid false zero-height
+    error bars in downstream plots.
+
+    :param reps: A list of ``RepeatResult``.
+    :return: A dict with mean, stdev, min, max.
+    """
+    if not reps:
+        return {"mean": None, "stdev": None, "min": None, "max": None}
+    throughputs = [r.throughput for r in reps]
+    if len(throughputs) < 2:
+        return {"mean": throughputs[0], "stdev": None,
+                "min": None, "max": None}
+    return {
+        "mean": statistics.fmean(throughputs),
+        "stdev": statistics.stdev(throughputs),
+        "min": min(throughputs),
+        "max": max(throughputs),
+    }
+
+
+def run_sweep(axis: str, values: list, base: BenchConfig,
+              git_sha: Optional[str], output_path: str,
+              metadata: dict) -> dict:
+    """Run a sweep, flushing JSON to disk after every point.
+
+    :param axis: Sweep axis name.
+    :param values: Per-axis values in order.
+    :param base: Base configuration.
+    :param git_sha: Optional git sha to forward to children.
+    :param output_path: Destination JSON file.
+    :param metadata: Initial metadata dict (will be updated with
+        ``finished_at`` at end).
+    :return: The final results dict (also written to disk).
+    """
+    points = []
+    fixed = asdict(base)
+    fixed.pop("workers", None) if axis == "workers" else None
+    rendered_values = [list(v) if isinstance(v, tuple) else v for v in values]
+    sweep_meta = {"axis": axis, "values": rendered_values, "fixed": fixed}
+
+    interrupted = False
+    for value in values:
+        cfg = cfg_for_axis(base, axis, value)
+        err = validate_config(cfg)
+        inputs = asdict(cfg)
+        if err is not None:
+            point = PointResult(inputs=inputs,
+                                error={"message": err, "stderr_tail": ""})
+            points.append(asdict(point))
+            print(f"point {axis}={value}: validation error: {err}",
+                  file=sys.stderr)
+            _flush_results(output_path, metadata, sweep_meta, points)
+            continue
+
+        repeats: list = []
+        try:
+            for r in range(base.repeats):
+                print(f"point {axis}={value} repeat {r + 1}/{base.repeats}: "
+                      "spawning child...", file=sys.stderr)
+                try:
+                    rep = run_in_subprocess(cfg, r, git_sha)
+                    repeats.append(rep)
+                    print(f"  -> {rep.throughput:.1f} behaviors/s "
+                          f"({rep.completed_behaviors} in "
+                          f"{rep.elapsed_s:.2f}s)", file=sys.stderr)
+                except RuntimeError as ex:
+                    point = PointResult(
+                        inputs=inputs,
+                        repeats=[asdict(r) for r in repeats],
+                        error={"message": str(ex), "stderr_tail": ""})
+                    points.append(asdict(point))
+                    _flush_results(output_path, metadata, sweep_meta, points)
+                    repeats = None  # marker: already appended
+                    break
+        except KeyboardInterrupt:
+            interrupted = True
+            metadata["interrupted"] = True
+            if repeats:
+                point = PointResult(
+                    inputs=inputs,
+                    repeats=[asdict(r) for r in repeats],
+                    error={"message": "interrupted", "stderr_tail": ""})
+                points.append(asdict(point))
+            _flush_results(output_path, metadata, sweep_meta, points)
+            break
+
+        if repeats is None:
+            continue
+
+        summary = summarize_repeats(repeats)
+        point = PointResult(
+            inputs=inputs,
+            repeats=[asdict(r) for r in repeats],
+            throughput_mean=summary["mean"],
+            throughput_stdev=summary["stdev"],
+            throughput_min=summary["min"],
+            throughput_max=summary["max"])
+        points.append(asdict(point))
+        _flush_results(output_path, metadata, sweep_meta, points)
+
+    metadata["finished_at"] = datetime.now().isoformat(timespec="seconds")
+    metadata["interrupted"] = interrupted or metadata.get("interrupted", False)
+    final = _flush_results(output_path, metadata, sweep_meta, points)
+    return final
+
+
+def _flush_results(path: str, metadata: dict, sweep_meta: dict,
+                   points: list) -> dict:
+    """Atomic write of the results JSON; falls back to in-place on Windows.
+
+    :param path: Destination file path.
+    :param metadata: Top-level metadata dict.
+    :param sweep_meta: Sweep description dict.
+    :param points: List of point dicts.
+    :return: The full results document that was written.
+    """
+    document = {
+        "schema_version": SCHEMA_VERSION,
+        "metadata": metadata,
+        "sweep": sweep_meta,
+        "points": points,
+    }
+    serialized = json.dumps(document, indent=2, default=_json_default)
+    os.makedirs(os.path.dirname(os.path.abspath(path)) or ".", exist_ok=True)
+    tmp = path + ".tmp"
+    with open(tmp, "w", encoding="utf-8") as f:
+        f.write(serialized)
+    delays = (0.05, 0.1, 0.2)
+    for attempt, delay in enumerate(delays):
+        try:
+            os.replace(tmp, path)
+            return document
+        except PermissionError:
+            if attempt == len(delays) - 1:
+                print(f"warning: atomic rename failed after {len(delays)} "
+                      "attempts; falling back to in-place overwrite",
+                      file=sys.stderr)
+                with open(path, "w", encoding="utf-8") as f:
+                    f.write(serialized)
+                try:
+                    os.unlink(tmp)
+                except OSError:
+                    pass
+                return document
+            time.sleep(delay)
+    return document
+
+
+def _json_default(obj):
+    """Coerce non-JSON-native objects (e.g. tuples) for serialization.
+
+    :param obj: An object json.dumps could not serialize natively.
+    :return: A JSON-serializable representation.
+    """
+    if isinstance(obj, (set, frozenset)):
+        return list(obj)
+    raise TypeError(f"object of type {type(obj).__name__} is not "
+                    "JSON-serializable")
+
+
+# ---------------------------------------------------------------------------
+# Metadata
+# ---------------------------------------------------------------------------
+
+
+def collect_metadata(argv: list, git_sha: Optional[str]) -> dict:
+    """Collect metadata for the top of the results JSON.
+
+    :param argv: The parent's ``sys.argv``.
+    :param git_sha: The git sha (or None).
+    :return: A metadata dict.
+    """
+    try:
+        bocpy_version = _read_bocpy_version()
+    except Exception:
+        bocpy_version = None
+
+    free_threaded = bool(getattr(sys, "_is_gil_enabled",
+                                 lambda: True)() is False)
+    return {
+        "hostname": socket.gethostname(),
+        "platform": sys.platform,
+        "cpu_count": os.cpu_count() or 0,
+        "python_version": sys.version.split()[0],
+        "python_implementation": sys.implementation.name,
+        "free_threaded": free_threaded,
+        "bocpy_version": bocpy_version,
+        "git_sha": git_sha,
+        "started_at": datetime.now().isoformat(timespec="seconds"),
+        "finished_at": None,
+        "argv": list(argv),
+        "interrupted": False,
+    }
+
+
+def _read_bocpy_version() -> Optional[str]:
+    """Best-effort read of bocpy's version from importlib.metadata.
+
+    :return: Version string or None on failure.
+    """
+    try:
+        from importlib.metadata import version
+        return version("bocpy")
+    except Exception:
+        return None
+
+
+def _git_sha() -> Optional[str]:
+    """Read git sha if available; cheap-and-fail-quietly.
+
+    :return: A 12-char abbreviated sha, or None.
+    """
+    cached = os.environ.get("BOCPY_BENCH_GIT_SHA")
+    if cached:
+        return cached
+    try:
+        out = subprocess.run(
+            ["git", "rev-parse", "--short=12", "HEAD"],
+            capture_output=True, text=True, timeout=5, check=False)
+        if out.returncode == 0:
+            return out.stdout.strip() or None
+    except (FileNotFoundError, subprocess.TimeoutExpired):
+        pass
+    return None
+
+
+# ---------------------------------------------------------------------------
+# ASCII table renderer
+# ---------------------------------------------------------------------------
+
+
+def render_table(document: dict) -> str:
+    """Render a compact ASCII summary table from a results document.
+
+    :param document: A loaded results JSON.
+    :return: A multi-line string ready to print.
+    """
+    axis = document["sweep"]["axis"]
+    points = document["points"]
+    interrupted = document.get("metadata", {}).get("interrupted", False)
+
+    lines = []
+    show_speedup = axis == "workers"
+    baseline = None
+    if show_speedup and points:
+        first = points[0]
+        if interrupted or first.get("error") is not None \
+                or first.get("throughput_mean") is None:
+            show_speedup = False
+            lines.append("note: speedup/efficiency suppressed (baseline "
+                         "missing, errored, or interrupted run)")
+        else:
+            baseline = first["throughput_mean"]
+
+    headers = [axis, "throughput", "stdev"]
+    if show_speedup:
+        headers += ["speedup", "efficiency"]
+    rows = []
+    for pt in points:
+        if pt.get("error") is not None:
+            row = [_axis_label(axis, pt), "ERROR", "-"]
+            if show_speedup:
+                row += ["-", "-"]
+            rows.append(row)
+            continue
+        mean = pt.get("throughput_mean")
+        stdev = pt.get("throughput_stdev")
+        row = [
+            _axis_label(axis, pt),
+            f"{mean:.1f}" if mean is not None else "-",
+            f"{stdev:.1f}" if stdev is not None else "-",
+        ]
+        if show_speedup:
+            speedup = (mean / baseline) if mean and baseline else None
+            workers = pt["inputs"]["workers"]
+            efficiency = (speedup / workers) if speedup and workers else None
+            row += [
+                f"{speedup:.2f}x" if speedup is not None else "-",
+                f"{efficiency:.0%}" if efficiency is not None else "-",
+            ]
+        rows.append(row)
+
+    widths = [max(len(h), max((len(r[i]) for r in rows), default=0))
+              for i, h in enumerate(headers)]
+    sep = "-+-".join("-" * w for w in widths)
+    lines.append(" | ".join(h.ljust(widths[i]) for i, h in enumerate(headers)))
+    lines.append(sep)
+    for r in rows:
+        lines.append(" | ".join(r[i].ljust(widths[i]) for i in range(len(r))))
+    return "\n".join(lines)
+
+
+def _axis_label(axis: str, pt: dict) -> str:
+    """Render the axis cell value for a point row.
+
+    :param axis: Sweep axis name.
+    :param pt: A point dict.
+    :return: A string for the axis column.
+    """
+    inputs = pt.get("inputs", {})
+    if axis == "workers":
+        return str(inputs.get("workers"))
+    if axis == "iters":
+        return str(inputs.get("iters"))
+    if axis == "group-size":
+        return str(inputs.get("group_size"))
+    if axis == "payload":
+        return f"{inputs.get('payload_rows')}x{inputs.get('payload_cols')}"
+    return "-"
+
+
+# ---------------------------------------------------------------------------
+# CLI
+# ---------------------------------------------------------------------------
+
+
+def parse_payload_token(token: str) -> tuple:
+    """Parse a payload token of the form ``"<rows>x<cols>"``.
+
+    :param token: The CLI token.
+    :return: A ``(rows, cols)`` tuple.
+    """
+    if "x" not in token:
+        raise argparse.ArgumentTypeError(
+            f"payload value {token!r} must look like '<rows>x<cols>'")
+    rs, cs = token.split("x", 1)
+    try:
+        rows, cols = int(rs), int(cs)
+    except ValueError:
+        raise argparse.ArgumentTypeError(
+            f"payload value {token!r}: rows and cols must be integers")
+    if rows < 1 or cols < 1:
+        raise argparse.ArgumentTypeError(
+            f"payload value {token!r}: rows and cols must be >= 1")
+    return (rows, cols)
+
+
+def parse_sweep_values(axis: str, raw: Optional[str]) -> list:
+    """Parse ``--sweep-values`` per-axis at argparse time.
+
+    :param axis: The sweep axis.
+    :param raw: The raw CSV string, or None.
+    :return: A list of values appropriate for the axis.
+    """
+    if axis == "none":
+        if raw:
+            raise argparse.ArgumentTypeError(
+                "--sweep-values must be empty when --sweep-axis is 'none'")
+        return [None]
+    if raw is None:
+        return _default_sweep_values(axis)
+    tokens = [t.strip() for t in raw.split(",") if t.strip()]
+    if not tokens:
+        return _default_sweep_values(axis)
+    if axis in ("workers", "iters", "group-size"):
+        out = []
+        for t in tokens:
+            try:
+                out.append(int(t))
+            except ValueError:
+                raise argparse.ArgumentTypeError(
+                    f"--sweep-values: token {t!r} is not an integer "
+                    f"(axis={axis})")
+        return out
+    if axis == "payload":
+        return [parse_payload_token(t) for t in tokens]
+    raise argparse.ArgumentTypeError(f"unknown axis: {axis}")
+
+
+def _default_sweep_values(axis: str) -> list:
+    """Return the documented default sweep values for an axis.
+
+    :param axis: The sweep axis name.
+    :return: A list of default values.
+    """
+    cpu = os.cpu_count() or 1
+    if axis == "workers":
+        return sorted(set([1, 2, 4, 8, min(16, cpu)]))
+    if axis == "iters":
+        return [250, 500, 1000, 2000, 4000, 8000]
+    if axis == "group-size":
+        return [1, 2, 4, 8]
+    if axis == "payload":
+        return [(4, 4), (8, 8), (16, 16), (32, 32), (64, 64)]
+    return []
+
+
+def build_arg_parser() -> argparse.ArgumentParser:
+    """Build the CLI argument parser.
+
+    :return: A configured ``argparse.ArgumentParser``.
+    """
+    p = argparse.ArgumentParser(
+        prog="bocpy.examples.benchmark",
+        description="Microbenchmark for the BOC runtime.")
+    p.add_argument("--workers", type=int, default=None)
+    p.add_argument("--sweep-axis",
+                   choices=("workers", "iters", "group-size", "payload",
+                            "none"),
+                   default="workers")
+    p.add_argument("--sweep-values", default=None)
+    p.add_argument("--duration", type=float, default=5.0)
+    p.add_argument("--warmup", type=float, default=None)
+    p.add_argument("--iters", type=int, default=2000)
+    p.add_argument("--group-size", type=int, default=2, dest="group_size")
+    p.add_argument("--stride", type=int, default=1)
+    p.add_argument("--rings", type=int, default=None)
+    p.add_argument("--chains-per-ring", type=int, default=None,
+                   dest="chains_per_ring")
+    p.add_argument("--ring-size", type=int, default=128, dest="ring_size")
+    p.add_argument("--payload-rows", type=int, default=16,
+                   dest="payload_rows")
+    p.add_argument("--payload-cols", type=int, default=16,
+                   dest="payload_cols")
+    p.add_argument("--repeats", type=int, default=1)
+    p.add_argument("--null-payload", dest="null_payload",
+                   action="store_true", default=False,
+                   help="Skip the matmul inner loop in each behavior. "
+                        "Throughput then reflects pure BOC runtime "
+                        "overhead with the application work removed.")
+    p.add_argument("--output", default=None)
+    p.add_argument("--table", dest="table", action="store_true", default=None)
+    p.add_argument("--no-table", dest="table", action="store_false")
+    p.add_argument("--quiet", action="store_true")
+    p.add_argument("--emit-scheduler-stats", dest="emit_scheduler_stats",
+                   action="store_true", default=False,
+                   help="Capture _core.scheduler_stats() and "
+                        "_core.queue_stats() snapshots after each "
+                        "repeat and embed them in the result JSON.")
+    p.add_argument("--json-stdout", action="store_true",
+                   help="Run a single point and print sentinel-framed "
+                        "JSON to stdout (subprocess internal).")
+    p.add_argument("--print-table", default=None,
+                   help="Print a table from an existing JSON file and exit.")
+    return p
+
+
+def args_to_base_cfg(args) -> BenchConfig:
+    """Build a base ``BenchConfig`` from parsed CLI args.
+
+    :param args: The parsed argparse namespace.
+    :return: A ``BenchConfig`` (not yet derived).
+    """
+    workers = args.workers if args.workers is not None else 1
+    warmup = args.warmup
+    if warmup is None:
+        warmup = min(1.0, args.duration * 0.1)
+    return BenchConfig(
+        workers=workers,
+        duration=args.duration,
+        warmup=warmup,
+        iters=args.iters,
+        group_size=args.group_size,
+        stride=args.stride,
+        rings=args.rings,
+        chains_per_ring=args.chains_per_ring,
+        ring_size=args.ring_size,
+        payload_rows=args.payload_rows,
+        payload_cols=args.payload_cols,
+        repeats=args.repeats,
+        null_payload=args.null_payload,
+    )
+
+
+def child_main(args) -> int:
+    """Run a single point and emit a sentinel-framed JSON object.
+
+    Used by ``run_in_subprocess``.  The child does **not** run the
+    cross-worker validation gate — that runs once in the parent before
+    any sweep child is spawned.
+
+    :param args: The parsed argparse namespace.
+    :return: Process exit code.
+    """
+    cfg = derive_sizes(args_to_base_cfg(args))
+    err = validate_config(cfg)
+    if err is not None:
+        print(f"benchmark: invalid config: {err}", file=sys.stderr)
+        return 2
+    emit_soft_warnings(cfg, os.cpu_count() or 1)
+    rep = run_single_point_body(cfg, repeat_index=0)
+    payload = {
+        "inputs": asdict(cfg),
+        "completed_behaviors": rep.completed_behaviors,
+        "elapsed_s": rep.elapsed_s,
+        "throughput": rep.throughput,
+        "wall_clock_ns_start": rep.wall_clock_ns_start,
+    }
+    if args.emit_scheduler_stats:
+        # Read from the snapshot taken INSIDE run_single_point_body,
+        # before wait() freed the per-worker array. Querying _core
+        # here would return empty lists.
+        payload["scheduler_stats"] = rep.scheduler_stats or []
+        payload["queue_stats"] = rep.queue_stats or []
+    # Always forward derived metrics (small dict; harmless when None).
+    if rep.derived is not None:
+        payload["derived"] = rep.derived
+    sys.stdout.write("\n" + SENTINEL_BEGIN + "\n")
+    sys.stdout.write(json.dumps(payload, default=_json_default))
+    sys.stdout.write("\n" + SENTINEL_END + "\n")
+    sys.stdout.flush()
+    return 0
+
+
+def parent_main(args) -> int:
+    """Run a sweep across the requested axis.
+
+    :param args: The parsed argparse namespace.
+    :return: Process exit code.
+    """
+    base = args_to_base_cfg(args)
+    try:
+        sweep_values = parse_sweep_values(args.sweep_axis, args.sweep_values)
+    except argparse.ArgumentTypeError as ex:
+        print(f"benchmark: {ex}", file=sys.stderr)
+        return 2
+
+    # Pre-spawn validation across every sweep point.
+    cpu = os.cpu_count() or 1
+    derived_points = []
+    for value in sweep_values:
+        cfg = cfg_for_axis(base, args.sweep_axis, value)
+        err = validate_config(cfg)
+        if err is not None:
+            print(f"benchmark: sweep point {args.sweep_axis}={value} "
+                  f"invalid: {err}", file=sys.stderr)
+            return 2
+        emit_soft_warnings(cfg, cpu)
+        derived_points.append(cfg)
+
+    git_sha = _git_sha()
+
+    # Sidechannel: forward the emit-scheduler-stats flag to children
+    # via an env var. cfg_to_argv stays a pure function of BenchConfig
+    # because the flag is a reporting concern, not a workload knob.
+    if args.emit_scheduler_stats:
+        os.environ[BOCPY_BENCH_EMIT_SCHED_STATS_ENV] = "1"
+
+    # Wall-clock estimate for sweep duration.
+    startup_slack = 5.0
+    est = sum((cfg.duration + cfg.warmup + startup_slack) * base.repeats
+              for cfg in derived_points)
+    print(f"sweep estimate: {len(derived_points)} points "
+          f"x {base.repeats} repeats ~ {est:.0f}s wall clock",
+          file=sys.stderr)
+
+    output_path = args.output or _default_output_path()
+    metadata = collect_metadata(sys.argv, git_sha)
+    document = run_sweep(args.sweep_axis, sweep_values, base,
+                         git_sha, output_path, metadata)
+
+    if args.table is None:
+        show_table = sys.stdout.isatty()
+    else:
+        show_table = args.table
+    if show_table and not args.quiet:
+        print(render_table(document))
+    if not args.quiet:
+        print(f"results: {output_path}", file=sys.stderr)
+    return 0
+
+
+def _default_output_path() -> str:
+    """Compute the default output path under ``results/``.
+
+    Uses ``%Y%m%dT%H%M%S`` rather than ``isoformat()`` so the filename
+    is valid on Windows (no colons).
+
+    :return: A path string.
+    """
+    ts = datetime.now().strftime("%Y%m%dT%H%M%S")
+    host = socket.gethostname().replace(os.sep, "_")
+    return os.path.join("results", f"benchmark-{host}-{ts}.json")
+
+
+def main() -> int:
+    """CLI entry point.
+
+    :return: Process exit code.
+    """
+    if sys.version_info < (3, 12):
+        sys.exit("bocpy benchmarks require Python 3.12+ for "
+                 "sub-interpreter parallelism")
+
+    parser = build_arg_parser()
+    args = parser.parse_args()
+
+    if args.print_table is not None:
+        with open(args.print_table, encoding="utf-8") as f:
+            document = json.load(f)
+        print(render_table(document))
+        return 0
+
+    if args.json_stdout:
+        return child_main(args)
+
+    return parent_main(args)
+
+
+if __name__ == "__main__":
+    sys.exit(main())
diff --git a/examples/fanout_benchmark.py b/examples/fanout_benchmark.py
new file mode 100644
index 0000000..012ef8d
--- /dev/null
+++ b/examples/fanout_benchmark.py
@@ -0,0 +1,884 @@
+"""Fanout microbenchmark for the BOC runtime.
+
+Measures the dispatch-rate ceiling on a single producer worker for the
+fanout workload. Each producer behavior runs on a
+``Cown[ProducerState]`` and, on every step:
+
+1. Allocates ``fanout_width`` **fresh** ``Cown[Matrix]`` consumers
+   (the producer does not hold them).
+2. Dispatches ``@when(consumer_i)`` per consumer; each child mutates
+   its own cown and emits a ``"child"`` completion token.
+3. Reschedules itself on the producer cown until ``producer_steps``
+   steps have run.
+
+Because the producer never holds the consumer cowns, every child
+dispatch from the worker takes the producer-local arm of
+``boc_sched_dispatch`` (``dispatched_to_pending`` then ``pushed_local``
+once ``pending`` is occupied). Contention on the producer worker's
+per-worker queue back-pointer is the failure mode the per-worker
+``BOC_WSQ_N`` sub-queues address; this benchmark surfaces
+``enqueue_cas_retries`` on the producer worker as the gating signal.
+
+This file deliberately duplicates the harness scaffolding from
+``benchmark.py`` (rule-of-three: chain and fanout are the only two
+runtimes microbenchmarks today; refactoring is premature).
+"""
+
+import argparse
+import json
+import os
+import socket
+import statistics
+import subprocess
+import sys
+import time
+from dataclasses import asdict, dataclass, field
+from datetime import datetime
+from typing import Optional
+
+from bocpy import Cown, Matrix, receive, send, start, wait, when
+
+SENTINEL_BEGIN = "---BOCPY-FANOUT-BEGIN---"
+SENTINEL_END = "---BOCPY-FANOUT-END---"
+SCHEMA_VERSION = 1
+
+
+# ---------------------------------------------------------------------------
+# Behavior code (fanout workload, fresh-cown shape)
+# ---------------------------------------------------------------------------
+
+
+class ProducerState:
+    """Per-producer state held inside a ``Cown[ProducerState]``.
+
+    Holds plain ints only; the consumer cowns this producer dispatches
+    against are allocated fresh inside ``schedule_producer`` on every
+    step and never stored on the state.
+    """
+
+    def __init__(self, producer_id: int, fanout_width: int,
+                 child_iters: int, target_steps: int,
+                 payload_rows: int, payload_cols: int):
+        """Initialize a producer state.
+
+        :param producer_id: Unique id within the workload.
+        :param fanout_width: Children dispatched per step (K).
+        :param child_iters: Inner-loop matmul iterations per child.
+        :param target_steps: Number of producer steps before this
+            producer stops self-rescheduling.
+        :param payload_rows: Rows of each fresh consumer matrix.
+        :param payload_cols: Cols of each fresh consumer matrix.
+        """
+        self.producer_id = producer_id
+        self.fanout_width = fanout_width
+        self.child_iters = child_iters
+        self.target_steps = target_steps
+        self.payload_rows = payload_rows
+        self.payload_cols = payload_cols
+        self.dispatched = 0
+        self.steps = 0
+
+
+def schedule_child(consumer_cown: Cown, child_iters: int) -> None:
+    """Schedule one child step on a fresh consumer cown.
+
+    The child does ``child_iters`` in-place self-multiplications of
+    its matrix, then emits a ``("child", 1)`` token so the parent
+    can count completions.
+
+    :param consumer_cown: The child's exclusively-acquired matrix cown.
+    :param child_iters: Inner-loop matmul iterations, captured.
+    """
+    @when(consumer_cown)
+    def _child(c):
+        for _ in range(child_iters):
+            c.value = c.value @ c.value
+        send("child", 1)
+
+
+def schedule_producer(p_cown: Cown) -> None:
+    """Schedule one producer step on ``p_cown``.
+
+    Allocates ``fanout_width`` fresh ``Cown[Matrix]`` consumers,
+    dispatches one child per consumer, then either reschedules
+    itself or emits ``("producer_done", producer_id)`` when
+    ``target_steps`` is reached.
+
+    The producer holds only ``p_cown``; the fresh consumer cowns are
+    not in its acquired set, so each child dispatch takes the
+    producer-local arm of ``boc_sched_dispatch`` and the producer
+    worker is never blocked by a child.
+
+    :param p_cown: The producer's ``Cown[ProducerState]``.
+    """
+    @when(p_cown)
+    def _step(producer):
+        ps = producer.value
+        rows, cols = ps.payload_rows, ps.payload_cols
+        k = ps.fanout_width
+        for _ in range(k):
+            consumer = Cown(Matrix.uniform(0.0, 1.0, (rows, cols)))
+            schedule_child(consumer, ps.child_iters)
+        ps.dispatched += k
+        ps.steps += 1
+        if ps.steps >= ps.target_steps:
+            send("producer_done", (ps.producer_id, ps.dispatched))
+            return
+        # Pass the already-acquired wrapper rather than the
+        # closure-captured ``p_cown`` to keep the capture set minimal.
+        schedule_producer(producer)
+
+
+# ---------------------------------------------------------------------------
+# Configuration and result types
+# ---------------------------------------------------------------------------
+
+
+@dataclass
+class FanoutConfig:
+    """Plain-data fanout configuration (no Cowns)."""
+
+    workers: int = 4
+    producers: Optional[int] = None
+    fanout_width: Optional[int] = None
+    child_iters: int = 1
+    producer_steps: int = 1000
+    payload_rows: int = 16
+    payload_cols: int = 16
+    repeats: int = 1
+
+
+@dataclass
+class RepeatResult:
+    """Plain-data result for a single repeat of one sweep point."""
+
+    repeat_index: int
+    completed_children: int
+    elapsed_s: float
+    throughput: float
+    wall_clock_ns_start: int
+    scheduler_stats: Optional[list] = None
+    derived: Optional[dict] = None
+
+
+@dataclass
+class PointResult:
+    """Plain-data result for a single sweep point."""
+
+    inputs: dict
+    repeats: list = field(default_factory=list)
+    throughput_mean: Optional[float] = None
+    throughput_stdev: Optional[float] = None
+    throughput_min: Optional[float] = None
+    throughput_max: Optional[float] = None
+    error: Optional[dict] = None
+
+
+# ---------------------------------------------------------------------------
+# Sizing / validation
+# ---------------------------------------------------------------------------
+
+
+def derive_sizes(cfg: FanoutConfig) -> FanoutConfig:
+    """Auto-size ``producers`` and ``fanout_width`` if not overridden.
+
+    Defaults: one producer per ~4 workers (minimum 1), and
+    ``K = 4 * workers`` children per producer step. These reproduce
+    a contention-heavy operating point on the fanout workload.
+
+    :param cfg: An input config (mutated and returned).
+    :return: The same config.
+    """
+    if cfg.producers is None:
+        cfg.producers = max(1, cfg.workers // 4)
+    if cfg.fanout_width is None:
+        cfg.fanout_width = max(1, 4 * cfg.workers)
+    return cfg
+
+
+def validate_config(cfg: FanoutConfig) -> Optional[str]:
+    """Validate a fully-derived config.
+
+    :param cfg: A config with ``producers`` and ``fanout_width`` set.
+    :return: An error message, or ``None`` if valid.
+    """
+    if cfg.workers < 1:
+        return f"workers must be >= 1, got {cfg.workers}"
+    if cfg.producers is None or cfg.producers < 1:
+        return f"producers must be >= 1, got {cfg.producers}"
+    if cfg.fanout_width is None or cfg.fanout_width < 1:
+        return f"fanout_width must be >= 1, got {cfg.fanout_width}"
+    if cfg.child_iters < 1:
+        return f"child_iters must be >= 1, got {cfg.child_iters}"
+    if cfg.producer_steps < 1:
+        return f"producer_steps must be >= 1, got {cfg.producer_steps}"
+    if cfg.payload_rows < 1 or cfg.payload_cols < 1:
+        return "payload dimensions must be >= 1"
+    return None
+
+
+# ---------------------------------------------------------------------------
+# Single-point measurement
+# ---------------------------------------------------------------------------
+
+
+def run_single_point_body(cfg: FanoutConfig, repeat_index: int) -> RepeatResult:
+    """Run one fanout measurement in a fresh BOC runtime.
+
+    Total expected completions = ``producers * fanout_width *
+    producer_steps``. The parent waits for that many ``child`` tokens
+    and ``producers`` ``producer_done`` tokens before tearing the
+    runtime down. ``wait(stats=True)`` returns the per-worker
+    counters captured at shutdown.
+
+    :param cfg: The fully-derived config.
+    :param repeat_index: Repeat index for reporting.
+    :return: A ``RepeatResult`` with no Cown references.
+    """
+    start(worker_count=cfg.workers)
+    total_expected = cfg.producers * cfg.fanout_width * cfg.producer_steps
+    payload_bytes = cfg.payload_rows * cfg.payload_cols * 8
+    print(f"workload: fanout (fresh-cown) producers={cfg.producers} "
+          f"fanout_width={cfg.fanout_width} "
+          f"producer_steps={cfg.producer_steps} "
+          f"child_iters={cfg.child_iters} "
+          f"expected_children={total_expected} "
+          f"payload={cfg.payload_rows}x{cfg.payload_cols} "
+          f"(~{payload_bytes / 1024:.2f} KiB per consumer cown)",
+          file=sys.stderr)
+
+    # Allocate producer state cowns.
+    producer_cowns = [
+        Cown(ProducerState(
+            producer_id=pid,
+            fanout_width=cfg.fanout_width,
+            child_iters=cfg.child_iters,
+            target_steps=cfg.producer_steps,
+            payload_rows=cfg.payload_rows,
+            payload_cols=cfg.payload_cols))
+        for pid in range(cfg.producers)
+    ]
+
+    # Generous wall-clock ceiling.
+    timeout_s = max(60.0, total_expected * 0.001)
+
+    try:
+        wall_clock_ns_start = time.time_ns()
+        t_measure_start = time.perf_counter()
+
+        for p_cown in producer_cowns:
+            schedule_producer(p_cown)
+
+        # Drain child completions.
+        completed = 0
+        while completed < total_expected:
+            msg = receive(["child"], timeout_s)
+            if msg is None or msg[0] != "child":
+                raise RuntimeError(
+                    f"only {completed}/{total_expected} child tokens "
+                    f"received within {timeout_s:.0f}s")
+            completed += 1
+
+        # Drain producer-done acks.
+        producer_dispatched = 0
+        for _ in range(cfg.producers):
+            msg = receive(["producer_done"], timeout_s)
+            if msg is None or msg[0] != "producer_done":
+                raise RuntimeError(
+                    f"producer_done not received within {timeout_s:.0f}s")
+            _, (_pid, count) = msg
+            producer_dispatched += count
+
+        t_end = time.perf_counter()
+        elapsed_s = t_end - t_measure_start
+
+        if producer_dispatched != completed:
+            raise RuntimeError(
+                f"dispatched/completed mismatch: dispatched="
+                f"{producer_dispatched} completed={completed}")
+    finally:
+        del producer_cowns
+        sched_stats_end = wait(stats=True)
+
+    throughput = completed / elapsed_s if elapsed_s > 0 else 0.0
+    return RepeatResult(
+        repeat_index=repeat_index,
+        completed_children=int(completed),
+        elapsed_s=elapsed_s,
+        throughput=throughput,
+        wall_clock_ns_start=wall_clock_ns_start,
+        scheduler_stats=sched_stats_end,
+        derived=compute_derived_metrics(sched_stats_end, int(completed)))
+
+
+# ---------------------------------------------------------------------------
+# Derived metrics (dispatch-contention signal)
+# ---------------------------------------------------------------------------
+
+
+def compute_derived_metrics(stats: Optional[list],
+                            completed_children: int) -> dict:
+    """Compute the dispatch-contention signal from a per-worker stats snapshot.
+
+    The producer worker is identified as the worker with the largest
+    ``pushed_local + dispatched_to_pending`` total over the session.
+    The gate ratio is ``enqueue_cas_retries / (pushed_local +
+    dispatched_to_pending)`` on that worker.
+
+    Also computes a **fairness** signal: how evenly the work landed
+    across workers, measured as the coefficient of variation of
+    ``popped_local + popped_via_steal`` across all workers, plus the
+    Gini coefficient of the same vector. Lower is fairer; perfectly
+    balanced (every worker did the same number of behaviors) is
+    ``fairness_cv = 0`` and ``fairness_gini = 0``.
+
+    :param stats: Per-worker snapshot from ``wait(stats=True)``.
+    :param completed_children: Total child completions over the run.
+    :return: A dict with the gate inputs and outputs.
+    """
+    out = {
+        "producer_worker_index": None,
+        "producer_pushed_local": 0,
+        "producer_dispatched_to_pending": 0,
+        "producer_enqueue_cas_retries": 0,
+        "enq_retry_ratio": None,
+        "steal_yield": None,
+        "idle_ratio": None,
+        "fairness_cv": None,
+        "fairness_gini": None,
+        "worker_pop_min": None,
+        "worker_pop_max": None,
+        "worker_pop_mean": None,
+        "worker_pop_counts": None,
+    }
+    if not stats:
+        return out
+    producer_pushes = [
+        int(w.get("pushed_local", 0))
+        + int(w.get("dispatched_to_pending", 0))
+        for w in stats
+    ]
+    if max(producer_pushes) == 0:
+        return out
+    p_idx = max(range(len(producer_pushes)), key=lambda i: producer_pushes[i])
+    p_local = int(stats[p_idx].get("pushed_local", 0))
+    p_pending = int(stats[p_idx].get("dispatched_to_pending", 0))
+    p_enq_r = int(stats[p_idx].get("enqueue_cas_retries", 0))
+    p_total = p_local + p_pending
+    out["producer_worker_index"] = p_idx
+    out["producer_pushed_local"] = p_local
+    out["producer_dispatched_to_pending"] = p_pending
+    out["producer_enqueue_cas_retries"] = p_enq_r
+    out["enq_retry_ratio"] = (p_enq_r / p_total) if p_total > 0 else None
+
+    total_steal = sum(int(w.get("popped_via_steal", 0)) for w in stats)
+    if completed_children > 0:
+        out["steal_yield"] = total_steal / completed_children
+
+    total_attempts = sum(int(w.get("steal_attempts", 0)) for w in stats)
+    total_failures = sum(int(w.get("steal_failures", 0)) for w in stats)
+    if total_attempts > 0:
+        out["idle_ratio"] = total_failures / total_attempts
+
+    # Fairness: distribution of work across workers. We count
+    # popped_local + popped_via_steal per worker — this is what each
+    # worker actually executed (regardless of who pushed it). For a
+    # single-producer fanout the producer worker pushes everything;
+    # fairness measures whether stealing redistributed evenly.
+    pops = [
+        int(w.get("popped_local", 0)) + int(w.get("popped_via_steal", 0))
+        for w in stats
+    ]
+    n = len(pops)
+    total = sum(pops)
+    if n > 0 and total > 0:
+        mean = total / n
+        if n > 1:
+            stdev = statistics.pstdev(pops)
+            out["fairness_cv"] = stdev / mean if mean > 0 else None
+        else:
+            out["fairness_cv"] = 0.0
+        # Gini: 0 is perfectly equal, 1 is maximally unequal.
+        sorted_pops = sorted(pops)
+        cum = 0
+        weighted = 0
+        for i, v in enumerate(sorted_pops, start=1):
+            cum += v
+            weighted += i * v
+        if cum > 0:
+            out["fairness_gini"] = (2 * weighted) / (n * cum) - (n + 1) / n
+        out["worker_pop_min"] = min(pops)
+        out["worker_pop_max"] = max(pops)
+        out["worker_pop_mean"] = mean
+        out["worker_pop_counts"] = pops
+    return out
+
+
+# ---------------------------------------------------------------------------
+# Subprocess orchestration (one repeat per child, fresh runtime)
+# ---------------------------------------------------------------------------
+
+
+def cfg_to_argv(cfg: FanoutConfig) -> list:
+    """Render a ``FanoutConfig`` as CLI args for a child invocation.
+
+    :param cfg: The config to serialize.
+    :return: A list of CLI arguments.
+    """
+    args = [
+        "--workers", str(cfg.workers),
+        "--child-iters", str(cfg.child_iters),
+        "--producer-steps", str(cfg.producer_steps),
+        "--payload-rows", str(cfg.payload_rows),
+        "--payload-cols", str(cfg.payload_cols),
+        "--repeats", "1",
+        "--sweep-axis", "none",
+    ]
+    if cfg.producers is not None:
+        args += ["--producers", str(cfg.producers)]
+    if cfg.fanout_width is not None:
+        args += ["--fanout-width", str(cfg.fanout_width)]
+    return args
+
+
+def run_in_subprocess(cfg: FanoutConfig, repeat_index: int,
+                      git_sha: Optional[str]) -> RepeatResult:
+    """Run one repeat in a fresh subprocess and return its result.
+
+    :param cfg: A fully-derived config.
+    :param repeat_index: Index into the parent's ``repeats[]`` list.
+    :param git_sha: Optional git sha forwarded to the child.
+    :return: A ``RepeatResult``.
+    """
+    env = dict(os.environ)
+    if git_sha is not None:
+        env["BOCPY_BENCH_GIT_SHA"] = git_sha
+    cmd = [sys.executable, "-m", "bocpy.examples.fanout_benchmark",
+           "--json-stdout"] + cfg_to_argv(cfg)
+    total_expected = cfg.producers * cfg.fanout_width * cfg.producer_steps
+    timeout = max(120.0, total_expected * 0.002 + 30)
+    try:
+        proc = subprocess.run(cmd, env=env, capture_output=True,
+                              text=True, timeout=timeout, check=False)
+    except subprocess.TimeoutExpired as ex:
+        raise RuntimeError(
+            f"subprocess timed out after {timeout}s; "
+            f"stderr tail: {(ex.stderr or '')[-400:]!r}")
+    if proc.returncode != 0:
+        raise RuntimeError(
+            f"subprocess exited {proc.returncode}; "
+            f"stderr tail: {proc.stderr[-400:]!r}")
+    payload = _extract_sentinel_payload(proc.stdout)
+    if payload is None:
+        raise RuntimeError(
+            "child produced no sentinel-framed JSON; "
+            f"stderr tail: {proc.stderr[-400:]!r}")
+    return RepeatResult(
+        repeat_index=repeat_index,
+        completed_children=int(payload["completed_children"]),
+        elapsed_s=float(payload["elapsed_s"]),
+        throughput=float(payload["throughput"]),
+        wall_clock_ns_start=int(payload["wall_clock_ns_start"]),
+        scheduler_stats=payload.get("scheduler_stats"),
+        derived=payload.get("derived"))
+
+
+def _extract_sentinel_payload(stdout: str) -> Optional[dict]:
+    """Find and parse exactly one sentinel-framed JSON object."""
+    begin = stdout.find(SENTINEL_BEGIN)
+    end = stdout.find(SENTINEL_END)
+    if begin < 0 or end < 0 or end < begin:
+        return None
+    inner = stdout[begin + len(SENTINEL_BEGIN):end].strip()
+    try:
+        return json.loads(inner)
+    except json.JSONDecodeError:
+        return None
+
+
+# ---------------------------------------------------------------------------
+# Sweep orchestration
+# ---------------------------------------------------------------------------
+
+
+def cfg_for_axis(base: FanoutConfig, axis: str, value) -> FanoutConfig:
+    """Clone ``base`` with one axis varied to ``value``.
+
+    :param base: The base config.
+    :param axis: One of ``workers``, ``fanout-width``, ``producers``,
+        ``child-iters``, ``producer-steps``, ``none``.
+    :param value: The axis value.
+    :return: A fresh ``FanoutConfig``.
+    """
+    cfg = FanoutConfig(**asdict(base))
+    if axis == "workers":
+        cfg.workers = int(value)
+        # Re-derive producers/fanout-width when sweeping workers
+        # unless the user explicitly pinned them at the base.
+        if base.producers is None:
+            cfg.producers = None
+        if base.fanout_width is None:
+            cfg.fanout_width = None
+    elif axis == "fanout-width":
+        cfg.fanout_width = int(value)
+    elif axis == "producers":
+        cfg.producers = int(value)
+    elif axis == "child-iters":
+        cfg.child_iters = int(value)
+    elif axis == "producer-steps":
+        cfg.producer_steps = int(value)
+    elif axis == "none":
+        pass
+    else:
+        raise ValueError(f"unknown axis: {axis}")
+    return derive_sizes(cfg)
+
+
+def summarize_repeats(reps: list) -> dict:
+    """Compute mean/stdev/min/max across repeats.
+
+    With <2 repeats, stdev/min/max are emitted as JSON null to avoid
+    false zero-height error bars in plots.
+
+    :param reps: A list of ``RepeatResult``.
+    :return: A summary dict.
+    """
+    if not reps:
+        return {"mean": None, "stdev": None, "min": None, "max": None}
+    throughputs = [r.throughput for r in reps]
+    if len(throughputs) < 2:
+        return {"mean": throughputs[0], "stdev": None,
+                "min": None, "max": None}
+    return {
+        "mean": statistics.fmean(throughputs),
+        "stdev": statistics.stdev(throughputs),
+        "min": min(throughputs),
+        "max": max(throughputs),
+    }
+
+
+def run_sweep(axis: str, values: list, base: FanoutConfig,
+              git_sha: Optional[str], output_path: str,
+              metadata: dict) -> dict:
+    """Run a sweep, flushing JSON to disk after every point."""
+    points = []
+    fixed = asdict(base)
+    rendered_values = [list(v) if isinstance(v, tuple) else v for v in values]
+    sweep_meta = {"axis": axis, "values": rendered_values, "fixed": fixed}
+
+    interrupted = False
+    for value in values:
+        cfg = cfg_for_axis(base, axis, value)
+        err = validate_config(cfg)
+        inputs = asdict(cfg)
+        if err is not None:
+            point = PointResult(inputs=inputs,
+                                error={"message": err, "stderr_tail": ""})
+            points.append(asdict(point))
+            print(f"point {axis}={value}: validation error: {err}",
+                  file=sys.stderr)
+            _flush_results(output_path, metadata, sweep_meta, points)
+            continue
+
+        repeats: list = []
+        try:
+            for r in range(base.repeats):
+                print(f"point {axis}={value} repeat {r + 1}/{base.repeats}: "
+                      "spawning child...", file=sys.stderr)
+                try:
+                    rep = run_in_subprocess(cfg, r, git_sha)
+                    repeats.append(rep)
+                    print(f"  -> {rep.throughput:.1f} children/s "
+                          f"({rep.completed_children} in "
+                          f"{rep.elapsed_s:.2f}s)", file=sys.stderr)
+                except RuntimeError as ex:
+                    point = PointResult(
+                        inputs=inputs,
+                        repeats=[asdict(r) for r in repeats],
+                        error={"message": str(ex), "stderr_tail": ""})
+                    points.append(asdict(point))
+                    _flush_results(output_path, metadata, sweep_meta, points)
+                    repeats = None
+                    break
+        except KeyboardInterrupt:
+            interrupted = True
+            metadata["interrupted"] = True
+            if repeats:
+                point = PointResult(
+                    inputs=inputs,
+                    repeats=[asdict(r) for r in repeats],
+                    error={"message": "interrupted", "stderr_tail": ""})
+                points.append(asdict(point))
+            _flush_results(output_path, metadata, sweep_meta, points)
+            break
+
+        if repeats is None:
+            continue
+
+        summary = summarize_repeats(repeats)
+        point = PointResult(
+            inputs=inputs,
+            repeats=[asdict(r) for r in repeats],
+            throughput_mean=summary["mean"],
+            throughput_stdev=summary["stdev"],
+            throughput_min=summary["min"],
+            throughput_max=summary["max"])
+        points.append(asdict(point))
+        _flush_results(output_path, metadata, sweep_meta, points)
+
+    metadata["finished_at"] = datetime.now().isoformat(timespec="seconds")
+    metadata["interrupted"] = interrupted or metadata.get("interrupted", False)
+    final = _flush_results(output_path, metadata, sweep_meta, points)
+    return final
+
+
+def _flush_results(path: str, metadata: dict, sweep_meta: dict,
+                   points: list) -> dict:
+    """Atomic write of the results JSON; falls back to in-place on Windows."""
+    document = {
+        "schema_version": SCHEMA_VERSION,
+        "metadata": metadata,
+        "sweep": sweep_meta,
+        "points": points,
+    }
+    serialized = json.dumps(document, indent=2, default=_json_default)
+    os.makedirs(os.path.dirname(os.path.abspath(path)) or ".", exist_ok=True)
+    tmp = path + ".tmp"
+    with open(tmp, "w", encoding="utf-8") as f:
+        f.write(serialized)
+    delays = (0.05, 0.1, 0.2)
+    for attempt, delay in enumerate(delays):
+        try:
+            os.replace(tmp, path)
+            return document
+        except PermissionError:
+            if attempt == len(delays) - 1:
+                with open(path, "w", encoding="utf-8") as f:
+                    f.write(serialized)
+                try:
+                    os.unlink(tmp)
+                except OSError:
+                    pass
+                return document
+            time.sleep(delay)
+    return document
+
+
+def _json_default(obj):
+    """Coerce non-JSON-native objects for serialization."""
+    if isinstance(obj, (set, frozenset)):
+        return list(obj)
+    raise TypeError(f"object of type {type(obj).__name__} is not "
+                    "JSON-serializable")
+
+
+# ---------------------------------------------------------------------------
+# Metadata
+# ---------------------------------------------------------------------------
+
+
+def collect_metadata(argv: list, git_sha: Optional[str]) -> dict:
+    """Collect metadata for the top of the results JSON."""
+    try:
+        from importlib.metadata import version
+        bocpy_version = version("bocpy")
+    except Exception:
+        bocpy_version = None
+    free_threaded = bool(getattr(sys, "_is_gil_enabled",
+                                 lambda: True)() is False)
+    return {
+        "hostname": socket.gethostname(),
+        "platform": sys.platform,
+        "cpu_count": os.cpu_count() or 0,
+        "python_version": sys.version.split()[0],
+        "python_implementation": sys.implementation.name,
+        "free_threaded": free_threaded,
+        "bocpy_version": bocpy_version,
+        "git_sha": git_sha,
+        "started_at": datetime.now().isoformat(timespec="seconds"),
+        "finished_at": None,
+        "argv": list(argv),
+        "interrupted": False,
+    }
+
+
+def _git_sha() -> Optional[str]:
+    """Read git sha if available; cheap-and-fail-quietly."""
+    cached = os.environ.get("BOCPY_BENCH_GIT_SHA")
+    if cached:
+        return cached
+    try:
+        out = subprocess.run(
+            ["git", "rev-parse", "--short=12", "HEAD"],
+            capture_output=True, text=True, timeout=5, check=False)
+        if out.returncode == 0:
+            return out.stdout.strip() or None
+    except (FileNotFoundError, subprocess.TimeoutExpired):
+        pass
+    return None
+
+
+# ---------------------------------------------------------------------------
+# CLI
+# ---------------------------------------------------------------------------
+
+
+def parse_sweep_values(axis: str, raw: Optional[str]) -> list:
+    """Parse ``--sweep-values`` per-axis at argparse time.
+
+    :param axis: The sweep axis.
+    :param raw: The raw CSV string, or None.
+    :return: A list of values appropriate for the axis.
+    """
+    if axis == "none":
+        if raw:
+            raise argparse.ArgumentTypeError(
+                "--sweep-values must be empty when --sweep-axis is 'none'")
+        return [None]
+    if raw is None:
+        return _default_sweep_values(axis)
+    tokens = [t.strip() for t in raw.split(",") if t.strip()]
+    if not tokens:
+        return _default_sweep_values(axis)
+    out = []
+    for t in tokens:
+        try:
+            out.append(int(t))
+        except ValueError:
+            raise argparse.ArgumentTypeError(
+                f"--sweep-values: token {t!r} is not an integer "
+                f"(axis={axis})")
+    return out
+
+
+def _default_sweep_values(axis: str) -> list:
+    """Return the documented default sweep values for an axis."""
+    cpu = os.cpu_count() or 1
+    if axis == "workers":
+        return sorted(set([1, 2, 4, 8, min(16, cpu)]))
+    if axis == "fanout-width":
+        return [1, 2, 4, 8, 16, 32]
+    if axis == "producers":
+        return [1, 2, 4, 8]
+    if axis == "child-iters":
+        return [1, 2, 4, 8]
+    if axis == "producer-steps":
+        return [100, 500, 1000, 5000]
+    return []
+
+
+def build_arg_parser() -> argparse.ArgumentParser:
+    """Build the CLI argument parser."""
+    p = argparse.ArgumentParser(
+        prog="bocpy.examples.fanout_benchmark",
+        description="Fanout microbenchmark for the BOC runtime.")
+    p.add_argument("--workers", type=int, default=4)
+    p.add_argument("--sweep-axis",
+                   choices=("workers", "fanout-width", "producers",
+                            "child-iters", "producer-steps", "none"),
+                   default="workers")
+    p.add_argument("--sweep-values", default=None)
+    p.add_argument("--producers", type=int, default=None)
+    p.add_argument("--fanout-width", type=int, default=None,
+                   dest="fanout_width")
+    p.add_argument("--child-iters", type=int, default=1, dest="child_iters")
+    p.add_argument("--producer-steps", type=int, default=1000,
+                   dest="producer_steps")
+    p.add_argument("--payload-rows", type=int, default=16,
+                   dest="payload_rows")
+    p.add_argument("--payload-cols", type=int, default=16,
+                   dest="payload_cols")
+    p.add_argument("--repeats", type=int, default=1)
+    p.add_argument("--output", default=None)
+    p.add_argument("--quiet", action="store_true")
+    p.add_argument("--json-stdout", action="store_true",
+                   help="Run a single point and print sentinel-framed "
+                        "JSON to stdout (subprocess internal).")
+    return p
+
+
+def args_to_base_cfg(args) -> FanoutConfig:
+    """Build a base ``FanoutConfig`` from parsed CLI args."""
+    return FanoutConfig(
+        workers=args.workers,
+        producers=args.producers,
+        fanout_width=args.fanout_width,
+        child_iters=args.child_iters,
+        producer_steps=args.producer_steps,
+        payload_rows=args.payload_rows,
+        payload_cols=args.payload_cols,
+        repeats=args.repeats,
+    )
+
+
+def child_main(args) -> int:
+    """Run a single point and emit a sentinel-framed JSON object."""
+    cfg = derive_sizes(args_to_base_cfg(args))
+    err = validate_config(cfg)
+    if err is not None:
+        print(f"fanout_benchmark: invalid config: {err}", file=sys.stderr)
+        return 2
+    rep = run_single_point_body(cfg, repeat_index=0)
+    payload = {
+        "inputs": asdict(cfg),
+        "completed_children": rep.completed_children,
+        "elapsed_s": rep.elapsed_s,
+        "throughput": rep.throughput,
+        "wall_clock_ns_start": rep.wall_clock_ns_start,
+        "scheduler_stats": rep.scheduler_stats or [],
+    }
+    if rep.derived is not None:
+        payload["derived"] = rep.derived
+    sys.stdout.write("\n" + SENTINEL_BEGIN + "\n")
+    sys.stdout.write(json.dumps(payload, default=_json_default))
+    sys.stdout.write("\n" + SENTINEL_END + "\n")
+    sys.stdout.flush()
+    return 0
+
+
+def parent_main(args) -> int:
+    """Run a sweep across the requested axis."""
+    base = args_to_base_cfg(args)
+    try:
+        sweep_values = parse_sweep_values(args.sweep_axis, args.sweep_values)
+    except argparse.ArgumentTypeError as ex:
+        print(f"fanout_benchmark: {ex}", file=sys.stderr)
+        return 2
+    for value in sweep_values:
+        cfg = cfg_for_axis(base, args.sweep_axis, value)
+        err = validate_config(cfg)
+        if err is not None:
+            print(f"fanout_benchmark: sweep point {args.sweep_axis}={value} "
+                  f"invalid: {err}", file=sys.stderr)
+            return 2
+    git_sha = _git_sha()
+    output_path = args.output or _default_output_path()
+    metadata = collect_metadata(sys.argv, git_sha)
+    run_sweep(args.sweep_axis, sweep_values, base, git_sha,
+              output_path, metadata)
+    if not args.quiet:
+        print(f"results: {output_path}", file=sys.stderr)
+    return 0
+
+
+def _default_output_path() -> str:
+    """Compute the default output path under ``results/``."""
+    ts = datetime.now().strftime("%Y%m%dT%H%M%S")
+    host = socket.gethostname().replace(os.sep, "_")
+    return os.path.join("results", f"fanout-{host}-{ts}.json")
+
+
+def main() -> int:
+    """CLI entry point."""
+    if sys.version_info < (3, 12):
+        sys.exit("bocpy benchmarks require Python 3.12+ for "
+                 "sub-interpreter parallelism")
+    parser = build_arg_parser()
+    args = parser.parse_args()
+    if args.json_stdout:
+        return child_main(args)
+    return parent_main(args)
+
+
+if __name__ == "__main__":
+    sys.exit(main())
diff --git a/pyproject.toml b/pyproject.toml
index 300fb3b..ae9673c 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 
 [project]
 name = "bocpy"
-version = "0.4.0"
+version = "0.5.0"
 authors = [
     {name = "bocpy Team", email="bocpy@microsoft.com"}
 ]
diff --git a/setup.py b/setup.py
index 223bd72..a7721f8 100644
--- a/setup.py
+++ b/setup.py
@@ -1,4 +1,6 @@
+import os
 import re
+import sys
 from pathlib import Path
 
 from setuptools import Extension, setup
@@ -14,18 +16,55 @@
     flags=re.DOTALL,
 )
 
+# The `_internal_test` extension exposes private C primitives (atomics,
+# work-stealing queue cursors, MPMC behaviour queue) used only by the
+# pytest suite. It must NOT ship in distributed wheels. It is only built
+# when BOCPY_BUILD_INTERNAL_TESTS is set to a truthy value (e.g. "1"),
+# which the developer-facing test workflow / CI test job sets explicitly.
+#
+# As a hard backstop we also refuse to build it when setuptools is being
+# invoked to produce a wheel or sdist (e.g. by `pypa/cibuildwheel` in
+# `.github/workflows/build_wheels.yml`), regardless of the env var. This
+# guarantees the extension cannot leak into a release artifact even if a
+# future workflow accidentally inherits BOCPY_BUILD_INTERNAL_TESTS=1.
+_building_distribution = any(
+    cmd in sys.argv for cmd in ("bdist_wheel", "bdist_egg", "sdist")
+)
+_build_internal_tests = (
+    os.environ.get("BOCPY_BUILD_INTERNAL_TESTS", "").lower()
+    in ("1", "true", "yes", "on")
+    and not _building_distribution
+)
+
+_ext_modules = [
+    Extension(
+        name="bocpy._core",
+        sources=["src/bocpy/_core.c", "src/bocpy/compat.c", "src/bocpy/noticeboard.c",
+                 "src/bocpy/sched.c", "src/bocpy/tags.c", "src/bocpy/terminator.c"],
+    ),
+    Extension(
+        name="bocpy._math",
+        sources=["src/bocpy/_math.c", "src/bocpy/compat.c"],
+    ),
+]
+
+if _build_internal_tests:
+    _ext_modules.append(
+        Extension(
+            name="bocpy._internal_test",
+            sources=[
+                "src/bocpy/_internal_test.c",
+                "src/bocpy/_internal_test_atomics.c",
+                "src/bocpy/_internal_test_bq.c",
+                "src/bocpy/_internal_test_wsq.c",
+                "src/bocpy/compat.c",
+                "src/bocpy/sched.c",
+            ],
+        )
+    )
+
 setup(
     long_description=_readme,
     long_description_content_type="text/markdown",
-    ext_modules=[
-        Extension(
-            name="bocpy._core",
-            sources=["src/bocpy/_core.c"],
-        ),
-        Extension(
-            name="bocpy._math",
-            sources=["src/bocpy/_math.c"],
-        ),
-
-    ]
+    ext_modules=_ext_modules,
 )
diff --git a/sphinx/source/conf.py b/sphinx/source/conf.py
index e7fcb65..3526c8e 100644
--- a/sphinx/source/conf.py
+++ b/sphinx/source/conf.py
@@ -14,7 +14,7 @@
 project = 'bocpy'
 copyright = '2026, Microsoft'
 author = 'Microsoft'
-release = '0.4.0'
+release = '0.5.0'
 
 # -- General configuration ---------------------------------------------------
 # https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration
diff --git a/src/bocpy/__init__.pyi b/src/bocpy/__init__.pyi
index fc21f25..0f319b8 100644
--- a/src/bocpy/__init__.pyi
+++ b/src/bocpy/__init__.pyi
@@ -565,22 +565,65 @@ def notice_sync(timeout: Optional[float] = 30.0) -> int:
     """
 
 
-def wait(timeout: Optional[float] = None):
+def wait(timeout: Optional[float] = None, *, stats: bool = False):
     """Block until all behaviors complete, with optional timeout.
 
     On a successful return the runtime is **stopped**: workers are
-    joined, the noticeboard thread exits, the export tempdir is removed,
-    and the terminator is closed. The next ``@when`` call (or explicit
-    :func:`start`) will spin up a fresh runtime.
+    joined, the noticeboard thread exits, the C-level noticeboard
+    slot is released, and the terminator is closed. The next
+    ``@when`` call (or explicit :func:`start`) will spin up a fresh
+    runtime.
 
     Note that holding on to references to Cown objects such that they
     are deallocated after wait() is called results in undefined behavior.
 
     :param timeout: Maximum number of seconds to wait, or ``None`` to
         wait indefinitely. The timeout bounds only the quiescence and
-        noticeboard-drain phases; worker shutdown and tempdir cleanup
-        run to completion regardless.
+        noticeboard-drain phases; worker shutdown runs to completion
+        regardless. Values above ``1e9`` seconds (~31.7 years) are
+        clamped to wait-forever to avoid platform ``time_t`` /
+        ``DWORD`` overflow inside the underlying condition-variable
+        wait.
     :type timeout: Optional[float]
+    :param stats: If ``True``, return the per-worker
+        :func:`_core.scheduler_stats` snapshot captured at shutdown
+        (after every behavior has run, before the per-worker array
+        is freed). This is the only reliable way to read the
+        scheduler counters for the session that just ended --
+        calling :func:`_core.scheduler_stats` after :func:`wait`
+        returns ``[]`` because the per-worker array has already been
+        reclaimed. Returns ``[]`` if the runtime was never started
+        or the snapshot could not be captured. Each dict has the
+        keys documented on :func:`_core.scheduler_stats`
+        (``worker_index``, ``pushed_local``,
+        ``dispatched_to_pending``, ``pushed_remote``,
+        ``popped_local``, ``popped_via_steal``,
+        ``enqueue_cas_retries``, ``dequeue_cas_retries``,
+        ``batch_resets``, ``steal_attempts``, ``steal_failures``,
+        ``parked``, ``last_steal_attempt_ns``,
+        ``fairness_arm_fires``, plus the per-sub-queue
+        ``boc_bq_t`` counters).
+    :type stats: bool
+    :return: ``None`` when ``stats=False``; otherwise the per-worker
+        stats list (same shape as :func:`_core.scheduler_stats`).
+    :rtype: Optional[list[dict]]
+    :raises RuntimeError: If the noticeboard thread does not exit
+        before the timeout (or, on a retry call, is still alive).
+        The first failure carries the message prefix
+        ``"noticeboard thread did not shut down within timeout=..."``;
+        subsequent retry failures carry
+        ``"noticeboard thread still pinned on retry ..."``. Workers
+        and the orphan-behavior drain have already completed by the
+        time either is raised, so the runtime is intentionally left
+        re-drivable: callers may retry ``wait()`` / ``stop()`` once
+        the in-flight noticeboard mutation finishes. **Note:** when
+        ``stats=True`` and ``stop()`` raises *after* runtime
+        teardown has already completed (i.e. workers joined and the
+        noticeboard closed), the exception is suppressed and the
+        captured snapshot is returned instead — callers who require
+        the exception to propagate should call :func:`wait` (without
+        ``stats``) and read :func:`_core.scheduler_stats` from a
+        prior in-session call.
     """
 
 
@@ -617,10 +660,6 @@ def start(**kwargs):
     :param worker_count: The number of worker interpreters to start.  If
         ``None``, defaults to the number of available cores minus one.
     :type worker_count: Optional[int]
-    :param export_dir: The directory to which the target module will be
-        exported for worker import.  If ``None``, a temporary directory
-        will be created and removed on shutdown.
-    :type export_dir: Optional[str]
     :param module: A tuple of the target module name and file path to
         export for worker import.  If ``None``, the caller's module will
         be used.
diff --git a/src/bocpy/_core.c b/src/bocpy/_core.c
index 2b2224b..089cc6c 100644
--- a/src/bocpy/_core.c
+++ b/src/bocpy/_core.c
@@ -1,238 +1,12 @@
 #define PY_SSIZE_T_CLEAN
 
-#include <Python.h>
-#include <stdbool.h>
-#include <stdint.h>
-#include <time.h>
-
-#ifdef _WIN32
-#define WIN32_LEAN_AND_MEAN
-#include <process.h>
-#include <windows.h>
-typedef volatile int_least64_t atomic_int_least64_t;
-typedef volatile intptr_t atomic_intptr_t;
-
-int_least64_t atomic_fetch_add(atomic_int_least64_t *ptr, int_least64_t value) {
-  return InterlockedExchangeAdd64(ptr, value);
-}
-
-int_least64_t atomic_fetch_sub(atomic_int_least64_t *ptr, int_least64_t value) {
-  return InterlockedExchangeAdd64(ptr, -value);
-}
-
-bool atomic_compare_exchange_strong(atomic_int_least64_t *ptr,
-                                    atomic_int_least64_t *expected,
-                                    int_least64_t desired) {
-  int_least64_t prev;
-  prev = InterlockedCompareExchange64(ptr, desired, *expected);
-  if (prev == *expected) {
-    return true;
-  }
-
-  *expected = prev;
-  return false;
-}
-
-int_least64_t atomic_load(atomic_int_least64_t *ptr) { return *ptr; }
-
-int_least64_t atomic_exchange(atomic_int_least64_t *ptr, int_least64_t value) {
-  return InterlockedExchange64(ptr, value);
-}
-
-void atomic_store(atomic_int_least64_t *ptr, int_least64_t value) {
-  *ptr = value;
-}
-
-// ----- atomic_intptr_t siblings ---------------------------------------------
-// The MSVC polyfill defines `atomic_intptr_t` and `atomic_int_least64_t` as
-// distinct typedefs; the plain `atomic_load` / `atomic_store` / etc. above
-// only accept `atomic_int_least64_t *`. Without these siblings, code that
-// touches an `atomic_intptr_t` field (e.g. BOCRequest::next, BOCCown::last,
-// BOCRecycleQueue::head, BOCQueue::tag, NB_NOTICEBOARD_TID) would silently
-// pass a mistyped pointer to the int64 polyfill on Windows. On POSIX C11 the
-// same names are aliased to the generic atomic_* macros (which already
-// dispatch on type via _Generic), so user code below is platform-uniform.
-//
-// All Interlocked*Pointer intrinsics on x86/x64 are full barriers; the
-// pointer-width matches `intptr_t` on both Win32 and Win64 (CPython itself
-// requires a sane intptr_t == void* relationship).
-static inline intptr_t atomic_load_intptr(atomic_intptr_t *ptr) { return *ptr; }
-
-static inline void atomic_store_intptr(atomic_intptr_t *ptr, intptr_t value) {
-  *ptr = value;
-}
-
-static inline intptr_t atomic_exchange_intptr(atomic_intptr_t *ptr,
-                                              intptr_t value) {
-  return (intptr_t)InterlockedExchangePointer((PVOID volatile *)ptr,
-                                              (PVOID)value);
-}
-
-static inline bool atomic_compare_exchange_strong_intptr(atomic_intptr_t *ptr,
-                                                         intptr_t *expected,
-                                                         intptr_t desired) {
-  intptr_t prev = (intptr_t)InterlockedCompareExchangePointer(
-      (PVOID volatile *)ptr, (PVOID)desired, (PVOID)*expected);
-  if (prev == *expected) {
-    return true;
-  }
-  *expected = prev;
-  return false;
-}
-
-// All Interlocked* intrinsics on x86/x64 are full barriers, so the
-// memory_order argument is accepted but ignored.
-// Note: atomic_load_explicit is a plain volatile read. On x86/x64 this
-// provides acquire semantics due to TSO. Correctness of the parking
-// protocol relies on the mutex-protected re-check, not on seq_cst ordering.
-#define atomic_load_explicit(ptr, order) (*(ptr))
-#define atomic_fetch_add_explicit(ptr, val, order)                             \
-  InterlockedExchangeAdd64((ptr), (val))
-#define atomic_fetch_sub_explicit(ptr, val, order)                             \
-  InterlockedExchangeAdd64((ptr), -(val))
-#define memory_order_seq_cst 0
-
-#define thread_local __declspec(thread)
-
-typedef SRWLOCK BOCMutex;
-typedef CONDITION_VARIABLE BOCCond;
-
-static inline void boc_mtx_init(BOCMutex *m) { InitializeSRWLock(m); }
-
-static inline void mtx_destroy(BOCMutex *m) { (void)m; }
-
-static inline void mtx_lock(BOCMutex *m) { AcquireSRWLockExclusive(m); }
-
-static inline void mtx_unlock(BOCMutex *m) { ReleaseSRWLockExclusive(m); }
-
-static inline void cnd_init(BOCCond *c) { InitializeConditionVariable(c); }
-
-static inline void cnd_destroy(BOCCond *c) { (void)c; }
-
-static inline void cnd_signal(BOCCond *c) { WakeConditionVariable(c); }
-
-static inline void cnd_broadcast(BOCCond *c) { WakeAllConditionVariable(c); }
-
-static inline void cnd_wait(BOCCond *c, BOCMutex *m) {
-  SleepConditionVariableSRW(c, m, INFINITE, 0);
-}
-
-/// @brief Wait on a condition variable for at most @p seconds.
-/// @param c The condition variable
-/// @param m The mutex (must be held by caller)
-/// @return true if signalled (or spurious wake), false if the timeout expired
-static inline bool cnd_timedwait_s(BOCCond *c, BOCMutex *m, double seconds) {
-  if (seconds < 0)
-    seconds = 0;
-  DWORD ms = (DWORD)(seconds * 1000.0);
-  BOOL ok = SleepConditionVariableSRW(c, m, ms, 0);
-  if (!ok && GetLastError() == ERROR_TIMEOUT) {
-    return false;
-  }
-  return true;
-}
-
-void thrd_sleep(const struct timespec *duration, struct timespec *remaining) {
-  const DWORD MS_PER_NS = 1000000;
-  DWORD ms = (DWORD)duration->tv_sec * 1000;
-  ms += (DWORD)duration->tv_nsec / MS_PER_NS;
-  Sleep(ms);
-}
-
-#elif defined __APPLE__
-#include <errno.h>
-#include <pthread.h>
-#include <stdatomic.h>
-#define thrd_sleep nanosleep
-#define thread_local _Thread_local
-
-typedef pthread_mutex_t BOCMutex;
-typedef pthread_cond_t BOCCond;
-
-static inline void boc_mtx_init(BOCMutex *m) { pthread_mutex_init(m, NULL); }
-
-static inline void mtx_destroy(BOCMutex *m) { pthread_mutex_destroy(m); }
-
-static inline void mtx_lock(BOCMutex *m) { pthread_mutex_lock(m); }
-
-static inline void mtx_unlock(BOCMutex *m) { pthread_mutex_unlock(m); }
-
-static inline void cnd_init(BOCCond *c) { pthread_cond_init(c, NULL); }
-
-static inline void cnd_destroy(BOCCond *c) { pthread_cond_destroy(c); }
-
-static inline void cnd_signal(BOCCond *c) { pthread_cond_signal(c); }
-
-static inline void cnd_broadcast(BOCCond *c) { pthread_cond_broadcast(c); }
-
-static inline void cnd_wait(BOCCond *c, BOCMutex *m) {
-  pthread_cond_wait(c, m);
-}
-
-/// @brief Wait on a condition variable for at most @p seconds.
-/// @param c The condition variable
-/// @param m The mutex (must be held by caller)
-/// @return true if signalled (or spurious wake), false if the timeout expired
-static inline bool cnd_timedwait_s(BOCCond *c, BOCMutex *m, double seconds) {
-  if (seconds < 0)
-    seconds = 0;
-  struct timespec ts;
-  clock_gettime(CLOCK_REALTIME, &ts);
-  double total = (double)ts.tv_sec + (double)ts.tv_nsec * 1e-9 + seconds;
-  ts.tv_sec = (time_t)total;
-  ts.tv_nsec = (long)((total - (double)ts.tv_sec) * 1e9);
-  if (ts.tv_nsec >= 1000000000L) {
-    ts.tv_sec += 1;
-    ts.tv_nsec -= 1000000000L;
-  }
-  int rc = pthread_cond_timedwait(c, m, &ts);
-  return rc != ETIMEDOUT;
-}
-
-#else // Linux
-#include <errno.h>
-#include <stdatomic.h>
-#include <threads.h>
-
-typedef mtx_t BOCMutex;
-typedef cnd_t BOCCond;
-
-static inline void boc_mtx_init(BOCMutex *m) { mtx_init(m, mtx_plain); }
-
-/// @brief Wait on a condition variable for at most @p seconds.
-/// @param c The condition variable
-/// @param m The mutex (must be held by caller)
-/// @return true if signalled (or spurious wake), false if the timeout expired
-static inline bool cnd_timedwait_s(BOCCond *c, BOCMutex *m, double seconds) {
-  if (seconds < 0)
-    seconds = 0;
-  struct timespec ts;
-  clock_gettime(CLOCK_REALTIME, &ts);
-  double total = (double)ts.tv_sec + (double)ts.tv_nsec * 1e-9 + seconds;
-  ts.tv_sec = (time_t)total;
-  ts.tv_nsec = (long)((total - (double)ts.tv_sec) * 1e9);
-  if (ts.tv_nsec >= 1000000000L) {
-    ts.tv_sec += 1;
-    ts.tv_nsec -= 1000000000L;
-  }
-  int rc = cnd_timedwait(c, m, &ts);
-  return rc != thrd_timedout;
-}
-
-#endif
-
-#ifndef _WIN32
-// On POSIX the C11 atomic_* macros dispatch on type via _Generic, so the
-// `atomic_load(&intptr_var)` form Just Works. The `_intptr` siblings are
-// aliased to the generic forms purely so the source reads the same on
-// every platform; on Windows they expand to dedicated InterlockedXxxPointer
-// shims (see polyfill block above).
-#define atomic_load_intptr(ptr) atomic_load(ptr)
-#define atomic_store_intptr(ptr, val) atomic_store((ptr), (val))
-#define atomic_exchange_intptr(ptr, val) atomic_exchange((ptr), (val))
-#define atomic_compare_exchange_strong_intptr(ptr, expected, desired)          \
-  atomic_compare_exchange_strong((ptr), (expected), (desired))
-#endif
+#include "compat.h"
+#include "cown.h"
+#include "noticeboard.h"
+#include "sched.h"
+#include "tags.h"
+#include "terminator.h"
+#include "xidata.h"
 
 // Forward declaration — BOCQueue is defined below.
 typedef struct boc_queue BOCQueue;
@@ -269,15 +43,6 @@ static inline void boc_park_broadcast(BOCQueue *q);
 /// @param q The queue to park on
 static inline void boc_park_wait(BOCQueue *q);
 
-/// @brief Returns the current time as double-precision seconds.
-/// @return the current time
-static double boc_now_s(void);
-
-#if PY_VERSION_HEX >= 0x030D0000
-#define Py_BUILD_CORE
-#include <internal/pycore_crossinterp.h>
-#endif
-
 const struct timespec SLEEP_TS = {0, 1000};
 const char *BOC_TIMEOUT = "__timeout__";
 const int BOC_CAPACITY = 1024 * 16;
@@ -288,189 +53,9 @@ atomic_int_least64_t BOC_COWN_COUNT = 0;
 #define BOC_SPIN_COUNT 64
 #define BOC_BACKOFF_CAP_NS 1000000 // 1 ms
 
-// Portable yield: relinquish current CPU timeslice.
-#ifdef _WIN32
-#define boc_yield() SwitchToThread()
-#else
-#include <sched.h>
-#include <unistd.h>
-#define boc_yield() sched_yield()
-#endif
-
 // #define BOC_REF_TRACKING
 // #define BOC_TRACE
 
-#if PY_VERSION_HEX >= 0x030E0000 // 3.14
-
-#define XIDATA_FREE _PyXIData_Free
-#define XIDATA_SET_FREE _PyXIData_SET_FREE
-#define XIDATA_NEW() _PyXIData_New()
-#define XIDATA_NEWOBJECT _PyXIData_NewObject
-#define XIDATA_GETXIDATA(value, xidata)                                        \
-  _PyObject_GetXIDataNoFallback(PyThreadState_GET(), (value), (xidata))
-#define XIDATA_INIT _PyXIData_Init
-#define XIDATA_REGISTERCLASS(type, cb)                                         \
-  _PyXIData_RegisterClass(PyThreadState_GET(), (type),                         \
-                          (_PyXIData_getdata_t){.basic = (cb)})
-#define XIDATA_T _PyXIData_t
-
-static bool xidata_supported(PyObject *op) {
-  _PyXIData_getdata_t getdata = _PyXIData_Lookup(PyThreadState_GET(), op);
-  return getdata.basic != NULL || getdata.fallback != NULL;
-}
-
-#elif PY_VERSION_HEX >= 0x030D0000 // 3.13
-
-#define XIDATA_FREE _PyCrossInterpreterData_Free
-#define XIDATA_NEW() _PyCrossInterpreterData_New()
-#define XIDATA_NEWOBJECT _PyCrossInterpreterData_NewObject
-#define XIDATA_GETXIDATA(value, xidata)                                        \
-  _PyObject_GetCrossInterpreterData((value), (xidata))
-#define XIDATA_INIT _PyCrossInterpreterData_Init
-#define XIDATA_REGISTERCLASS(type, cb)                                         \
-  _PyCrossInterpreterData_RegisterClass((type), (crossinterpdatafunc)(cb))
-#define XIDATA_T _PyCrossInterpreterData
-
-static void xidata_set_free(XIDATA_T *xidata, void (*freefunc)(void *)) {
-  xidata->free = freefunc;
-}
-
-static bool xidata_supported(PyObject *op) {
-  crossinterpdatafunc getdata = _PyCrossInterpreterData_Lookup(op);
-  return getdata != NULL;
-}
-
-#define XIDATA_SET_FREE xidata_set_free
-
-#elif PY_VERSION_HEX >= 0x030C0000 // 3.12
-
-#define XIDATA_NEWOBJECT _PyCrossInterpreterData_NewObject
-#define XIDATA_INIT _PyCrossInterpreterData_Init
-#define XIDATA_GETXIDATA(value, xidata)                                        \
-  _PyObject_GetCrossInterpreterData((value), (xidata))
-#define XIDATA_REGISTERCLASS(type, cb)                                         \
-  _PyCrossInterpreterData_RegisterClass((type), (crossinterpdatafunc)(cb))
-#define XIDATA_T _PyCrossInterpreterData
-
-static XIDATA_T *xidata_new() {
-  XIDATA_T *xidata = (XIDATA_T *)PyMem_RawMalloc(sizeof(XIDATA_T));
-  xidata->data = NULL;
-  xidata->free = NULL;
-  xidata->interp = -1;
-  xidata->new_object = NULL;
-  xidata->obj = NULL;
-  return xidata;
-}
-
-static void xidata_set_free(XIDATA_T *xidata, void (*freefunc)(void *)) {
-  xidata->free = freefunc;
-}
-
-static bool xidata_supported(PyObject *op) {
-  crossinterpdatafunc getdata = _PyCrossInterpreterData_Lookup(op);
-  return getdata != NULL;
-}
-
-static void xidata_free(void *arg) {
-  XIDATA_T *xidata = (XIDATA_T *)arg;
-  if (xidata->data != NULL) {
-    if (xidata->free != NULL) {
-      xidata->free(xidata->data);
-    }
-    xidata->data = NULL;
-  }
-  Py_CLEAR(xidata->obj);
-  PyMem_RawFree(arg);
-}
-
-#define XIDATA_SET_FREE xidata_set_free
-#define XIDATA_NEW xidata_new
-#define XIDATA_FREE xidata_free
-
-#else
-
-#define BOC_NO_MULTIGIL
-
-#define XIDATA_NEWOBJECT _PyCrossInterpreterData_NewObject
-#define XIDATA_GETXIDATA(value, xidata)                                        \
-  _PyObject_GetCrossInterpreterData((value), (xidata))
-#define XIDATA_REGISTERCLASS(type, cb)                                         \
-  _PyCrossInterpreterData_RegisterClass((type), (crossinterpdatafunc)(cb))
-#define XIDATA_T _PyCrossInterpreterData
-
-static void xidata_set_free(XIDATA_T *xidata, void (*freefunc)(void *)) {
-  xidata->free = freefunc;
-}
-
-static void xidata_free(void *arg) {
-  XIDATA_T *xidata = (XIDATA_T *)arg;
-  if (xidata->data != NULL) {
-    if (xidata->free != NULL) {
-      xidata->free(xidata->data);
-    }
-    xidata->data = NULL;
-  }
-  Py_CLEAR(xidata->obj);
-  PyMem_RawFree(arg);
-}
-
-static XIDATA_T *xidata_new() {
-  XIDATA_T *xidata = (XIDATA_T *)PyMem_RawMalloc(sizeof(XIDATA_T));
-  xidata->data = NULL;
-  xidata->free = NULL;
-  xidata->interp = -1;
-  xidata->new_object = NULL;
-  xidata->obj = NULL;
-  return xidata;
-}
-
-static void xidata_init(XIDATA_T *data, PyInterpreterState *interp,
-                        void *shared, PyObject *obj,
-                        PyObject *(*new_object)(_PyCrossInterpreterData *)) {
-  assert(data->data == NULL);
-  assert(data->obj == NULL);
-  *data = (_PyCrossInterpreterData){0};
-  data->interp = -1;
-
-  assert(data != NULL);
-  assert(new_object != NULL);
-  data->data = shared;
-  if (obj != NULL) {
-    assert(interp != NULL);
-    data->obj = Py_NewRef(obj);
-  }
-  data->interp = (interp != NULL) ? PyInterpreterState_GetID(interp) : -1;
-  data->new_object = new_object;
-}
-
-#define XIDATA_SET_FREE xidata_set_free
-#define XIDATA_NEW xidata_new
-#define XIDATA_INIT xidata_init
-#define XIDATA_FREE xidata_free
-
-static bool xidata_supported(PyObject *op) {
-  crossinterpdatafunc getdata = _PyCrossInterpreterData_Lookup(op);
-  return getdata != NULL;
-}
-
-PyObject *PyErr_GetRaisedException(void) {
-  PyObject *et = NULL;
-  PyObject *ev = NULL;
-  PyObject *tb = NULL;
-  PyErr_Fetch(&et, &ev, &tb);
-  assert(et);
-  PyErr_NormalizeException(&et, &ev, &tb);
-  if (tb != NULL) {
-    PyException_SetTraceback(ev, tb);
-    Py_DECREF(tb);
-  }
-  Py_XDECREF(et);
-
-  return ev;
-}
-
-#endif
-
 /// @brief Note in a RecycleQueue.
 typedef struct boc_recycle_node {
   /// @brief XIData to free on the source interpreter
@@ -535,19 +120,23 @@ typedef struct boc_queue {
   BOCMutex park_mutex;
   /// @brief Condition variable for parking receivers
   BOCCond park_cond;
-} BOCQueue;
 
-/// @brief A tag for a BOC message.
-typedef struct boc_tag {
-  /// @brief The UTF-8 string value of the tag
-  char *str;
-  /// @brief The number of bytes in str (not including the NULL)
-  Py_ssize_t size;
-  /// @brief A pointer to the queue that this tag is associated with
-  BOCQueue *queue;
-  atomic_int_least64_t rc;
-  atomic_int_least64_t disabled;
-} BOCTag;
+  // Contention counters. Bumped with BOC_MO_RELAXED inside
+  // boc_enqueue / boc_dequeue. Read by `_core.queue_stats()`. Grouped
+  // and padded so they sit on their own cacheline and do not
+  // false-share with the hot head/tail/state above. Typed via
+  // `compat.h` so the build works on MSVC (which has no `_Atomic`).
+  /// @brief CAS retries observed by enqueuers contending on @c tail.
+  boc_atomic_u64_t enqueue_cas_retries;
+  /// @brief CAS retries observed by dequeuers contending on @c head.
+  boc_atomic_u64_t dequeue_cas_retries;
+  /// @brief Successful enqueues (post-CAS).
+  boc_atomic_u64_t pushed_total;
+  /// @brief Successful dequeues (post-CAS).
+  boc_atomic_u64_t popped_total;
+  /// @brief Padding so the next BOCQueue starts on a fresh cacheline.
+  char _pad_counters[64 - (4 * sizeof(uint64_t)) % 64];
+} BOCQueue;
 
 #define BOC_QUEUE_COUNT 16
 const int_least64_t BOC_QUEUE_UNASSIGNED = 0;
@@ -557,189 +146,6 @@ static BOCQueue BOC_QUEUES[BOC_QUEUE_COUNT];
 static BOCRecycleQueue *BOC_RECYCLE_QUEUE_TAIL = NULL;
 static atomic_intptr_t BOC_RECYCLE_QUEUE_HEAD = 0;
 
-// ---------------------------------------------------------------------------
-// Noticeboard
-// ---------------------------------------------------------------------------
-
-#define NB_MAX_ENTRIES 64
-#define NB_KEY_SIZE 64
-
-// Forward declarations needed by NoticeboardEntry and the noticeboard
-// helpers below. The full definitions of BOCCown and its refcount helpers
-// appear further down the file (the noticeboard predates the cown
-// machinery in source order, but the new pin-tracking support added for
-// the snapshot cache needs the cown refcount macros).
-typedef struct boc_cown BOCCown;
-static int_least64_t cown_incref(BOCCown *cown);
-static int_least64_t cown_decref(BOCCown *cown);
-#define COWN_INCREF(c) cown_incref((c))
-#define COWN_DECREF(c) cown_decref(c)
-
-// CownCapsule forward declaration so the noticeboard pin helper can fish
-// the underlying BOCCown out of a Python CownCapsule. The struct body is
-// defined alongside the type's PyTypeObject further down.
-typedef struct cown_capsule_object {
-  PyObject_HEAD BOCCown *cown;
-} CownCapsuleObject;
-
-/// @brief A single noticeboard entry
-typedef struct nb_entry {
-  /// @brief The key for this entry (null-terminated UTF-8)
-  char key[NB_KEY_SIZE];
-  /// @brief The serialized cross-interpreter data
-  XIDATA_T *value;
-  /// @brief Whether the value was pickled during serialization
-  bool pickled;
-  /// @brief BOCCowns referenced by @ref value, pinned by this entry
-  /// @details Allocated with @c PyMem_RawMalloc; each pointer holds one
-  /// strong reference (COWN_INCREF). When the entry is overwritten,
-  /// deleted, or cleared, every pointer is COWN_DECREFed and the array
-  /// is freed. This is the noticeboard's mechanism for keeping the
-  /// underlying BOCCowns alive across the 1-pickle / N-unpickle cycle:
-  /// pickling no longer adds a pin (see @ref CownCapsule_reduce).
-  BOCCown **pinned_cowns;
-  /// @brief Number of entries in @ref pinned_cowns
-  int pinned_count;
-} NoticeboardEntry;
-
-/// @brief Global noticeboard for cross-behavior key-value storage
-typedef struct noticeboard {
-  /// @brief The stored entries
-  NoticeboardEntry entries[NB_MAX_ENTRIES];
-  /// @brief The number of entries currently stored
-  int count;
-  /// @brief Mutex protecting the noticeboard
-  BOCMutex mutex;
-} Noticeboard;
-
-static Noticeboard NB;
-
-/// @brief Monotonic version counter for the noticeboard
-/// @details Incremented under @ref Noticeboard::mutex on every successful
-/// write, delete, or clear. Threads use this to lazily invalidate their
-/// thread-local snapshot cache without taking the noticeboard mutex on
-/// every read. Exposed to Python via @ref _core_noticeboard_version for
-/// users who want to detect noticeboard changes without taking a full
-/// snapshot.
-static atomic_int_least64_t NB_VERSION = 0;
-
-/// @brief Thread-local snapshot cache for the current behavior
-static thread_local PyObject *NB_SNAPSHOT_CACHE = NULL;
-
-/// @brief Version of the noticeboard at the time the cached snapshot was built
-/// @details Captured under @ref Noticeboard::mutex during the rebuild. A
-/// reader that finds @ref NB_VERSION equal to this value can reuse the
-/// cached dict without rebuilding.
-static thread_local int_least64_t NB_SNAPSHOT_VERSION = -1;
-
-/// @brief Whether the cached snapshot has been version-checked this behavior
-/// @details Cleared by @ref _core_noticeboard_cache_clear at every behavior
-/// boundary (see @c worker.py). Set to @c true on the first snapshot call
-/// of a behavior. Subsequent calls within the same behavior return the
-/// cached dict without consulting @ref NB_VERSION at all, preserving the
-/// no-polling invariant: the noticeboard cannot be used as a synchronous
-/// communication channel between behaviors.
-static thread_local bool NB_VERSION_CHECKED = false;
-
-/// @brief Read-only proxy wrapping the cached snapshot dict
-/// @details A @c types.MappingProxyType created over @ref NB_SNAPSHOT_CACHE
-/// once per rebuild and returned to callers in place of the dict. Prevents
-/// user code from mutating the cached snapshot, which would otherwise
-/// corrupt every subsequent reader on the same thread until the next
-/// @ref NB_VERSION bump.
-static thread_local PyObject *NB_SNAPSHOT_PROXY = NULL;
-
-/// @brief Thread identity of the noticeboard mutator thread, or 0 if unset
-/// @details Set by @ref _core_set_noticeboard_thread at runtime startup
-/// and checked by @ref _core_noticeboard_write_direct and
-/// @ref _core_noticeboard_delete to enforce the invariant that only the
-/// noticeboard thread mutates the noticeboard. This eliminates the TOCTOU
-/// window in the Python-level read-modify-write performed by
-/// @c noticeboard_update.
-static atomic_intptr_t NB_NOTICEBOARD_TID = 0;
-
-// ---------------------------------------------------------------------------
-// notice_sync() — opt-in barrier for the noticeboard thread.
-//
-// The noticeboard thread runs independently of the behavior dispatch path, so
-// notice_write/_update/_delete are fire-and-forget. Callers that need
-// read-your-writes ordering use notice_sync():
-//   1. notice_sync_request() atomically allocates a monotonic sequence
-//      number and returns it.
-//   2. The caller posts ("sync", N) on the boc_noticeboard tag.
-//   3. The noticeboard-thread arm calls notice_sync_complete(N), which
-//      stores N into NB_SYNC_PROCESSED (monotonic, max-of) and broadcasts
-//      NB_SYNC_COND.
-//   4. The caller blocks in notice_sync_wait(my_seq, timeout) on
-//      NB_SYNC_COND until NB_SYNC_PROCESSED >= my_seq, or returns false
-//      on timeout.
-//
-// All synchronization lives in C primitives so the barrier works across
-// sub-interpreters (Python locks do not span interpreters).
-// ---------------------------------------------------------------------------
-
-/// @brief Monotonic counter incremented by every notice_sync caller.
-/// @details Sized for ~292 years of continuous 1 GHz fetch_add traffic
-/// before wrap; treated as effectively non-wrapping. If the wrap
-/// precondition ever becomes plausible (e.g. a much faster mutator),
-/// switch to @c atomic_uint_least64_t and update the wrap arithmetic
-/// in @ref _core_notice_sync_wait.
-static atomic_int_least64_t NB_SYNC_REQUESTED = 0;
-
-/// @brief Highest sequence number processed by the noticeboard thread.
-static atomic_int_least64_t NB_SYNC_PROCESSED = 0;
-
-/// @brief Mutex protecting NB_SYNC_COND.
-static BOCMutex NB_SYNC_MUTEX;
-
-/// @brief Condition variable signalled when NB_SYNC_PROCESSED advances.
-static BOCCond NB_SYNC_COND;
-
-// ---------------------------------------------------------------------------
-// Terminator — C-level run-down counter.
-//
-// Process-global rundown counter that gates @c terminator_wait. Used by the
-// Python @c wait()/@c stop() lifecycle to block until every in-flight
-// behavior has retired. The counter is incremented from caller threads in
-// @c whencall (before the schedule call) and decremented from worker
-// threads after @c behavior_release_all completes. A one-shot "Pyrona
-// seed" of 1 keeps the count positive between the runtime starting and
-// @c stop() taking it down via @c terminator_seed_dec.
-//
-// Lifecycle:
-//   - @c terminator_reset arms a fresh runtime: count = 1 (the seed),
-//     seeded = 1, closed = 0. Returns the prior (count, seeded) so
-//     @c Behaviors.start can detect drift carried over from a previous
-//     run that died without reconciliation.
-//   - @c terminator_inc returns -1 once @c terminator_close has been
-//     called, so the @c whencall fast path can refuse new work without
-//     racing teardown.
-//   - @c terminator_seed_dec is the idempotent one-shot that drops the
-//     seed; subsequent calls are no-ops.
-//   - @c terminator_wait blocks on the condvar until count reaches 0.
-//   - @c terminator_close raises the closed bit so any straggler
-//     @c terminator_inc returns -1.
-//
-// State is process-global (file-scope statics, NOT per-interpreter) so
-// every sub-interpreter sees the same counter, mutex, and condvar.
-// ---------------------------------------------------------------------------
-
-/// @brief Active behavior count + the Pyrona seed.
-static atomic_int_least64_t TERMINATOR_COUNT = 0;
-
-/// @brief Set to 1 by terminator_close() to refuse further increments.
-static atomic_int_least64_t TERMINATOR_CLOSED = 0;
-
-/// @brief One-shot guard for the Pyrona seed: 1 = seed still present.
-static atomic_int_least64_t TERMINATOR_SEEDED = 0;
-
-/// @brief Mutex protecting TERMINATOR_COND.
-static BOCMutex TERMINATOR_MUTEX;
-
-/// @brief Condition variable signalled when TERMINATOR_COUNT reaches 0.
-static BOCCond TERMINATOR_COND;
-
-// ---------------------------------------------------------------------------
 // Platform condvar implementation
 // ---------------------------------------------------------------------------
 
@@ -769,135 +175,6 @@ static inline void boc_park_wait(BOCQueue *q) {
 
 // Noticeboard function implementations are below object_to_xidata
 
-/// @brief Creates a new BOCTag object from a Python Unicode string.
-/// @details The result object will not be dependent on the argument in any way
-/// (i.e., it can be safely deallocated).
-/// @param unicode A PyUnicode object
-/// @param queue The queue to associate with this tag
-/// @return a new BOCTag object
-BOCTag *tag_from_PyUnicode(PyObject *unicode, BOCQueue *queue) {
-  if (!PyUnicode_CheckExact(unicode)) {
-    PyErr_SetString(PyExc_TypeError, "Must be a str");
-    return NULL;
-  }
-
-  BOCTag *tag = (BOCTag *)PyMem_RawMalloc(sizeof(BOCTag));
-  if (tag == NULL) {
-    PyErr_NoMemory();
-    return NULL;
-  }
-
-  const char *str = PyUnicode_AsUTF8AndSize(unicode, &tag->size);
-  if (str == NULL) {
-    return NULL;
-  }
-
-  tag->str = (char *)PyMem_RawMalloc(tag->size + 1);
-
-  if (tag->str == NULL) {
-    PyErr_NoMemory();
-    return NULL;
-  }
-
-  memcpy(tag->str, str, tag->size + 1);
-  tag->queue = queue;
-  atomic_store(&tag->rc, 0);
-  atomic_store(&tag->disabled, 0);
-
-  return tag;
-}
-
-/// @brief Converts a BOCTag to a PyUnicode object.
-/// @note This method uses PyUnicode_FromStringAndSize() internally.
-/// @param tag The tag to convert
-/// @return A new reference to a PyUnicode object.
-PyObject *tag_to_PyUnicode(BOCTag *tag) {
-  return PyUnicode_FromStringAndSize(tag->str, tag->size);
-}
-
-/// @brief Frees a BOCTag object and any associated memory.
-/// @param tag The tag to free
-void BOCTag_free(BOCTag *tag) {
-  PyMem_RawFree(tag->str);
-  PyMem_RawFree(tag);
-}
-
-static int_least64_t tag_decref(BOCTag *tag) {
-  int_least64_t rc = atomic_fetch_add(&tag->rc, -1) - 1;
-  if (rc == 0) {
-    BOCTag_free(tag);
-  }
-
-  return rc;
-}
-
-#define TAG_DECREF(t) tag_decref(t)
-
-static int_least64_t tag_incref(BOCTag *tag) {
-  return atomic_fetch_add(&tag->rc, 1) + 1;
-}
-
-#define TAG_INCREF(t) tag_incref(t)
-
-bool tag_is_disabled(BOCTag *tag) { return atomic_load(&tag->disabled); }
-
-void tag_disable(BOCTag *tag) { atomic_store(&tag->disabled, 1); }
-
-/// @brief Compares a BOCTag with a UTF8 string.
-/// @details -1 if the tag should be placed before, 1 if after, 0 if equivalent
-/// @param lhs The BOCtag to compare
-/// @param rhs_str The string to compare with
-/// @param rhs_size The length of the comparison string
-/// @return -1 if before, 1 if after, 0 if equivalent
-int tag_compare_with_utf8(BOCTag *lhs, const char *rhs_str,
-                          Py_ssize_t rhs_size) {
-  Py_ssize_t size = lhs->size < rhs_size ? lhs->size : rhs_size;
-  char *lhs_ptr = lhs->str;
-  const char *rhs_ptr = rhs_str;
-  for (Py_ssize_t i = 0; i < size; ++i, ++lhs_ptr, ++rhs_ptr) {
-    int8_t a = (int8_t)(*lhs_ptr);
-    int8_t b = (int8_t)(*rhs_ptr);
-
-    if (a < b) {
-      return -1;
-    }
-    if (a > b) {
-      return 1;
-    }
-  }
-
-  if (lhs->size < rhs_size) {
-    return -1;
-  }
-
-  if (lhs->size > rhs_size) {
-    return 1;
-  }
-
-  return 0;
-}
-
-/// @brief Compares a BOCTag with a PyUnicode object.
-/// @details -1 if the tag should be placed before, 1 if after, 0 if equivalent
-/// @param lhs The BOCtag to compare
-/// @param rhs_str The string to compare with
-/// @param rhs_size The length of the comparison string
-/// @return -1 if before, 1 if after, 0 if equivalent
-int tag_compare_with_PyUnicode(BOCTag *lhs, PyObject *rhs_op) {
-  if (!PyUnicode_CheckExact(rhs_op)) {
-    PyErr_SetString(PyExc_TypeError, "Must be a str");
-    return -2;
-  }
-
-  Py_ssize_t rhs_size = -1;
-  const char *rhs_str = PyUnicode_AsUTF8AndSize(rhs_op, &rhs_size);
-  if (rhs_str == NULL) {
-    return -2;
-  }
-
-  return tag_compare_with_utf8(lhs, rhs_str, rhs_size);
-}
-
 /// @brief State for the module.
 typedef struct boc_state {
   /// @brief The index (monotonically increasing) for this module.
@@ -1154,137 +431,6 @@ static PyObject *object_to_xidata(PyObject *value, XIDATA_T **xidata_ptr) {
 // Noticeboard C functions
 // ---------------------------------------------------------------------------
 
-/// @brief Reject a noticeboard mutation called from outside the noticeboard
-/// thread.
-/// @details Sets a Python @c RuntimeError if a noticeboard thread has been
-/// registered (via @ref _core_set_noticeboard_thread) and the calling thread
-/// is not it. Prior to runtime startup the check is permissive so that
-/// @c Behaviors.stop and unit tests can drive the noticeboard from the
-/// main thread before the noticeboard thread is up. The single-writer
-/// invariant is what makes the Python-level read-modify-write in
-/// @c noticeboard_update TOCTOU-free.
-/// @param op_name Name of the operation, used in the error message
-/// @return 0 on success, -1 on error (with exception set)
-static int nb_check_noticeboard_thread(const char *op_name) {
-  uintptr_t owner = (uintptr_t)atomic_load_intptr(&NB_NOTICEBOARD_TID);
-  if (owner == 0) {
-    return 0;
-  }
-  uintptr_t current = (uintptr_t)PyThread_get_thread_ident();
-  if (current != owner) {
-    PyErr_Format(PyExc_RuntimeError,
-                 "%s must be called from the noticeboard thread", op_name);
-    return -1;
-  }
-  return 0;
-}
-
-/// @brief Take strong references to every CownCapsule in @p cowns
-/// @details Allocates a fresh @c BOCCown** array (or returns NULL if
-/// @p cowns is empty), iterates the sequence calling @c COWN_INCREF on
-/// each entry's underlying BOCCown, and writes the resulting array and
-/// count to @p out_array / @p out_count. On error, no INCREFs leak: any
-/// already-taken pins are dropped before return.
-/// @param cowns A Python sequence of CownCapsule objects (may be NULL or
-///   None for "no pins")
-/// @param out_array Out param for the allocated array
-/// @param out_count Out param for the number of entries
-/// @return 0 on success, -1 on error (with exception set)
-///
-/// @details The caller is expected to pass a sequence of integer pointers
-/// to BOCCown structs that have already been COWN_INCREFed by the writer
-/// thread (typically via @ref _core_cown_pin_pointers). This function
-/// **transfers** those refs into the noticeboard entry: it does not take
-/// any additional ref. On error every transferred ref is released so the
-/// caller can treat -1 as "ownership not taken, original refs already
-/// released".
-static int nb_pin_cowns(PyObject *cowns, BOCCown ***out_array, int *out_count) {
-  *out_array = NULL;
-  *out_count = 0;
-
-  if (cowns == NULL || cowns == Py_None) {
-    return 0;
-  }
-
-  PyObject *seq =
-      PySequence_Fast(cowns, "noticeboard pin list must be a sequence");
-  if (seq == NULL) {
-    return -1;
-  }
-
-  Py_ssize_t n = PySequence_Fast_GET_SIZE(seq);
-  if (n == 0) {
-    Py_DECREF(seq);
-    return 0;
-  }
-
-  BOCCown **pins = (BOCCown **)PyMem_RawMalloc(sizeof(BOCCown *) * n);
-  if (pins == NULL) {
-    Py_DECREF(seq);
-    PyErr_NoMemory();
-    return -1;
-  }
-
-  int taken = 0;
-  for (Py_ssize_t i = 0; i < n; i++) {
-    PyObject *item = PySequence_Fast_GET_ITEM(seq, i);
-    BOCCown *cown = (BOCCown *)PyLong_AsVoidPtr(item);
-    if (cown == NULL) {
-      // PyLong_AsVoidPtr returns NULL both on error and for integer 0.
-      // Reject both paths explicitly: a NULL pin would be dereferenced
-      // downstream (COWN_DECREF on NULL is UB), and an integer 0 is
-      // indistinguishable from a crafted attacker pin pointing at the
-      // zero page.
-      if (!PyErr_Occurred()) {
-        PyErr_SetString(PyExc_ValueError,
-                        "noticeboard pin list must not contain NULL / "
-                        "integer 0 entries");
-      } else {
-        PyErr_SetString(PyExc_TypeError,
-                        "noticeboard pin list must contain only integer "
-                        "BOCCown pointers (use _core.cown_pin_pointers())");
-      }
-      goto fail;
-    }
-    pins[taken++] = cown;
-  }
-
-  Py_DECREF(seq);
-  *out_array = pins;
-  *out_count = taken;
-  return 0;
-
-fail:
-  // Release every transferred ref the writer pre-INCREFed for us. The
-  // ones we already stashed into `pins` plus the rest of the sequence
-  // we never reached.
-  for (int i = 0; i < taken; i++) {
-    COWN_DECREF(pins[i]);
-  }
-  for (Py_ssize_t i = (Py_ssize_t)taken + 1; i < n; i++) {
-    PyObject *item = PySequence_Fast_GET_ITEM(seq, i);
-    BOCCown *c = (BOCCown *)PyLong_AsVoidPtr(item);
-    if (c != NULL) {
-      COWN_DECREF(c);
-    } else {
-      PyErr_Clear();
-    }
-  }
-  PyMem_RawFree(pins);
-  Py_DECREF(seq);
-  return -1;
-}
-
-/// @brief Drop the calling thread's snapshot cache and proxy
-/// @details Both objects are decref-cleared and the per-behavior version
-/// state is reset. Safe to call when nothing is cached.
-static void nb_drop_local_cache(void) {
-  Py_CLEAR(NB_SNAPSHOT_PROXY);
-  Py_CLEAR(NB_SNAPSHOT_CACHE);
-  NB_SNAPSHOT_VERSION = -1;
-  NB_VERSION_CHECKED = false;
-}
-
 /// @brief Write a key-value pair into the noticeboard under mutex
 /// @details The value is serialized to XIData here (in the main interpreter),
 /// so XIDATA_FREE is always safe to call from the same interpreter. The
@@ -1307,7 +453,7 @@ static PyObject *_core_noticeboard_write_direct(PyObject *self,
     return NULL;
   }
 
-  if (nb_check_noticeboard_thread("noticeboard_write_direct") < 0) {
+  if (noticeboard_check_thread("noticeboard_write_direct") < 0) {
     return NULL;
   }
 
@@ -1320,18 +466,6 @@ static PyObject *_core_noticeboard_write_direct(PyObject *self,
     return NULL;
   }
 
-  if (key_len >= NB_KEY_SIZE) {
-    PyErr_SetString(PyExc_ValueError,
-                    "noticeboard key too long (max 63 UTF-8 bytes)");
-    return NULL;
-  }
-
-  if (memchr(key, '\0', key_len) != NULL) {
-    PyErr_SetString(PyExc_ValueError,
-                    "noticeboard key must not contain NUL characters");
-    return NULL;
-  }
-
   // Pin the cowns BEFORE serializing so an error here does not leave us
   // with a stored entry whose cowns can be freed under us.
   BOCCown **new_pins = NULL;
@@ -1340,7 +474,7 @@ static PyObject *_core_noticeboard_write_direct(PyObject *self,
     return NULL;
   }
 
-  // Serialize the value to XIData in the main interpreter
+  // Serialize the value to XIData in the main interpreter.
   XIDATA_T *xidata = NULL;
   PyObject *pickled = object_to_xidata(value, &xidata);
   if (pickled == NULL) {
@@ -1358,69 +492,17 @@ static PyObject *_core_noticeboard_write_direct(PyObject *self,
   bool is_pickled = (pickled == Py_True);
   Py_DECREF(pickled);
 
-  mtx_lock(&NB.mutex);
-
-  // find existing entry or allocate new one
-  NoticeboardEntry *target = NULL;
-  for (int i = 0; i < NB.count; i++) {
-    if (strncmp(NB.entries[i].key, key, NB_KEY_SIZE) == 0) {
-      target = &NB.entries[i];
-      break;
-    }
-  }
-
-  if (target == NULL) {
-    if (NB.count >= NB_MAX_ENTRIES) {
-      mtx_unlock(&NB.mutex);
-      XIDATA_FREE(xidata);
-      for (int i = 0; i < new_pin_count; i++) {
-        COWN_DECREF(new_pins[i]);
-      }
-      PyMem_RawFree(new_pins);
-      PyErr_SetString(PyExc_RuntimeError, "Noticeboard is full (max 64)");
-      return NULL;
-    }
-    target = &NB.entries[NB.count++];
-    strncpy(target->key, key, NB_KEY_SIZE - 1);
-    target->key[NB_KEY_SIZE - 1] = '\0';
-    target->value = NULL;
-    target->pinned_cowns = NULL;
-    target->pinned_count = 0;
-  }
-
-  // Stash old value and old pins to free after releasing the mutex —
-  // XIDATA_FREE / COWN_DECREF may invoke Python __del__ which could
-  // re-enter the noticeboard.
-  XIDATA_T *old_value = target->value;
-  BOCCown **old_pins = target->pinned_cowns;
-  int old_pin_count = target->pinned_count;
-
-  target->value = xidata;
-  target->pickled = is_pickled;
-  target->pinned_cowns = new_pins;
-  target->pinned_count = new_pin_count;
-
-  // Bump the version under mutex so readers' acquire loads can lazily
-  // invalidate their thread-local snapshot caches without us touching
-  // their cache directly.
-  atomic_fetch_add(&NB_VERSION, 1);
-
-  mtx_unlock(&NB.mutex);
-
-  if (old_value != NULL) {
-    XIDATA_FREE(old_value);
-  }
-  if (old_pins != NULL) {
-    for (int i = 0; i < old_pin_count; i++) {
-      COWN_DECREF(old_pins[i]);
-    }
-    PyMem_RawFree(old_pins);
+  // noticeboard_write takes ownership of xidata + pins on success and
+  // frees them on failure.
+  if (noticeboard_write(key, key_len, xidata, is_pickled, new_pins,
+                        new_pin_count) < 0) {
+    return NULL;
   }
 
-  // Note: this thread's NB_SNAPSHOT_CACHE is intentionally NOT cleared.
-  // Within a behavior, a writer must not observe its own write — that is
-  // the no-polling invariant. The cache will be lazily revalidated at
-  // the next behavior boundary (see _core_noticeboard_cache_clear).
+  // Note: this thread's snapshot cache is intentionally NOT cleared.
+  // Within a behavior, a writer must not observe its own write — that
+  // is the no-polling invariant. The cache will be lazily revalidated
+  // at the next behavior boundary (see _core_noticeboard_cache_clear).
 
   Py_RETURN_NONE;
 }
@@ -1447,209 +529,7 @@ static PyObject *_core_noticeboard_write_direct(PyObject *self,
 static PyObject *_core_noticeboard_snapshot(PyObject *self,
                                             PyObject *Py_UNUSED(dummy)) {
   BOC_STATE_SET(self);
-
-  if (NB_SNAPSHOT_PROXY != NULL) {
-    if (NB_VERSION_CHECKED) {
-      // Within-behavior repeat call: same proxy, no atomic load.
-      Py_INCREF(NB_SNAPSHOT_PROXY);
-      return NB_SNAPSHOT_PROXY;
-    }
-    // First snapshot call this behavior: do exactly one version check.
-    int_least64_t current = atomic_load(&NB_VERSION);
-    if (current == NB_SNAPSHOT_VERSION) {
-      NB_VERSION_CHECKED = true;
-      Py_INCREF(NB_SNAPSHOT_PROXY);
-      return NB_SNAPSHOT_PROXY;
-    }
-    nb_drop_local_cache();
-  }
-
-  PyObject *dict = PyDict_New();
-  if (dict == NULL) {
-    return NULL;
-  }
-
-  // Deferred entries: pickled values whose bytes were extracted under mutex
-  // but need unpickling outside the lock.
-  PyObject *deferred_keys[NB_MAX_ENTRIES];
-  PyObject *deferred_bytes[NB_MAX_ENTRIES];
-  int deferred_count = 0;
-
-  // Keepalive pins: while we hold the mutex we take an extra COWN_INCREF
-  // on every pin reachable from a deferred (pickled) entry. The bytes we
-  // are about to unpickle outside the mutex contain raw BOCCown pointers
-  // whose validity depends on the entry's pin list. Without this extra
-  // ref, a concurrent writer could overwrite the entry the instant we
-  // drop the mutex, release the old pins, and free the BOCCowns before
-  // we touch them — UAF in _cown_capsule_from_pointer. Released after
-  // the deferred unpickling completes. Each deferred entry contributes
-  // a heap-allocated pin pointer array sized to its pin count.
-  BOCCown **keepalive_pins[NB_MAX_ENTRIES];
-  int keepalive_counts[NB_MAX_ENTRIES];
-  for (int i = 0; i < NB_MAX_ENTRIES; i++) {
-    keepalive_pins[i] = NULL;
-    keepalive_counts[i] = 0;
-  }
-
-  mtx_lock(&NB.mutex);
-
-  // Capture the noticeboard version while still holding the mutex so
-  // that no concurrent writer can bump it between snapshot completion
-  // and version capture.
-  int_least64_t built_version = atomic_load(&NB_VERSION);
-
-  for (int i = 0; i < NB.count; i++) {
-    NoticeboardEntry *entry = &NB.entries[i];
-    if (entry->value == NULL) {
-      continue;
-    }
-
-    // XIDATA_NEWOBJECT is lightweight (no Python code execution)
-    PyObject *raw = XIDATA_NEWOBJECT(entry->value);
-    if (raw == NULL) {
-      mtx_unlock(&NB.mutex);
-      goto fail_deferred;
-    }
-
-    PyObject *key = PyUnicode_FromString(entry->key);
-    if (key == NULL) {
-      Py_DECREF(raw);
-      mtx_unlock(&NB.mutex);
-      goto fail_deferred;
-    }
-
-    if (!entry->pickled) {
-      // Non-pickled: add directly to dict
-      if (PyDict_SetItem(dict, key, raw) < 0) {
-        Py_DECREF(key);
-        Py_DECREF(raw);
-        mtx_unlock(&NB.mutex);
-        goto fail_deferred;
-      }
-      Py_DECREF(key);
-      Py_DECREF(raw);
-    } else {
-      // Pickled: defer unpickling to outside the mutex. Take a fresh
-      // COWN_INCREF on every pin so the BOCCowns referenced by the bytes
-      // survive past mtx_unlock — see keepalive_pins comment above.
-      if (entry->pinned_count > 0) {
-        BOCCown **pins = (BOCCown **)PyMem_RawMalloc(sizeof(BOCCown *) *
-                                                     entry->pinned_count);
-        if (pins == NULL) {
-          Py_DECREF(key);
-          Py_DECREF(raw);
-          mtx_unlock(&NB.mutex);
-          PyErr_NoMemory();
-          goto fail_deferred;
-        }
-        for (int j = 0; j < entry->pinned_count; j++) {
-          pins[j] = entry->pinned_cowns[j];
-          COWN_INCREF(pins[j]);
-        }
-        keepalive_pins[deferred_count] = pins;
-        keepalive_counts[deferred_count] = entry->pinned_count;
-      }
-      deferred_keys[deferred_count] = key;
-      deferred_bytes[deferred_count] = raw;
-      deferred_count++;
-    }
-  }
-
-  mtx_unlock(&NB.mutex);
-
-  // Unpickle deferred entries outside the mutex
-  for (int i = 0; i < deferred_count; i++) {
-    PyObject *value = _PyPickle_Loads(deferred_bytes[i]);
-    Py_DECREF(deferred_bytes[i]);
-    deferred_bytes[i] = NULL;
-
-    if (value == NULL) {
-      Py_DECREF(deferred_keys[i]);
-      deferred_keys[i] = NULL;
-      // Clean up remaining deferred entries
-      for (int j = i + 1; j < deferred_count; j++) {
-        Py_DECREF(deferred_keys[j]);
-        Py_DECREF(deferred_bytes[j]);
-      }
-      // Release every keepalive pin (including the one for this entry).
-      for (int j = 0; j < deferred_count; j++) {
-        if (keepalive_pins[j] != NULL) {
-          for (int k = 0; k < keepalive_counts[j]; k++) {
-            COWN_DECREF(keepalive_pins[j][k]);
-          }
-          PyMem_RawFree(keepalive_pins[j]);
-          keepalive_pins[j] = NULL;
-        }
-      }
-      Py_DECREF(dict);
-      return NULL;
-    }
-
-    if (PyDict_SetItem(dict, deferred_keys[i], value) < 0) {
-      Py_DECREF(deferred_keys[i]);
-      Py_DECREF(value);
-      for (int j = i + 1; j < deferred_count; j++) {
-        Py_DECREF(deferred_keys[j]);
-        Py_DECREF(deferred_bytes[j]);
-      }
-      for (int j = 0; j < deferred_count; j++) {
-        if (keepalive_pins[j] != NULL) {
-          for (int k = 0; k < keepalive_counts[j]; k++) {
-            COWN_DECREF(keepalive_pins[j][k]);
-          }
-          PyMem_RawFree(keepalive_pins[j]);
-          keepalive_pins[j] = NULL;
-        }
-      }
-      Py_DECREF(dict);
-      return NULL;
-    }
-
-    Py_DECREF(deferred_keys[i]);
-    Py_DECREF(value);
-
-    // Successful unpickle: the snapshot dict (and its CownCapsules)
-    // now hold their own refs on every BOCCown referenced by the bytes.
-    // Drop our keepalive pin for this entry.
-    if (keepalive_pins[i] != NULL) {
-      for (int k = 0; k < keepalive_counts[i]; k++) {
-        COWN_DECREF(keepalive_pins[i][k]);
-      }
-      PyMem_RawFree(keepalive_pins[i]);
-      keepalive_pins[i] = NULL;
-    }
-  }
-
-  PyObject *proxy = PyDictProxy_New(dict);
-  if (proxy == NULL) {
-    Py_DECREF(dict);
-    return NULL;
-  }
-
-  // The proxy holds a strong reference to dict; we keep our own as well so
-  // that the dict is reachable for direct mutation in the rebuild path
-  // and the proxy survives at least as long as the dict.
-  NB_SNAPSHOT_CACHE = dict;
-  NB_SNAPSHOT_PROXY = proxy;
-  NB_SNAPSHOT_VERSION = built_version;
-  NB_VERSION_CHECKED = true;
-  Py_INCREF(proxy);
-  return proxy;
-
-fail_deferred:
-  for (int i = 0; i < deferred_count; i++) {
-    Py_DECREF(deferred_keys[i]);
-    Py_DECREF(deferred_bytes[i]);
-    if (keepalive_pins[i] != NULL) {
-      for (int k = 0; k < keepalive_counts[i]; k++) {
-        COWN_DECREF(keepalive_pins[i][k]);
-      }
-      PyMem_RawFree(keepalive_pins[i]);
-      keepalive_pins[i] = NULL;
-    }
-  }
-  Py_DECREF(dict);
-  return NULL;
+  return noticeboard_snapshot(BOC_STATE->loads);
 }
 
 /// @brief Clear all noticeboard entries and free their XIData and pins
@@ -1675,53 +555,7 @@ static PyObject *_core_noticeboard_clear(PyObject *self,
     return NULL;
   }
 
-  // Collect entries to free after releasing the mutex — XIDATA_FREE and
-  // COWN_DECREF may invoke Python __del__ which could re-enter the
-  // noticeboard.
-  XIDATA_T *to_free[NB_MAX_ENTRIES];
-  BOCCown **to_unpin[NB_MAX_ENTRIES];
-  int to_unpin_count[NB_MAX_ENTRIES];
-  int to_free_count = 0;
-  int to_unpin_entries = 0;
-
-  mtx_lock(&NB.mutex);
-
-  for (int i = 0; i < NB.count; i++) {
-    if (NB.entries[i].value != NULL) {
-      to_free[to_free_count++] = NB.entries[i].value;
-      NB.entries[i].value = NULL;
-    }
-    if (NB.entries[i].pinned_cowns != NULL) {
-      to_unpin[to_unpin_entries] = NB.entries[i].pinned_cowns;
-      to_unpin_count[to_unpin_entries] = NB.entries[i].pinned_count;
-      to_unpin_entries++;
-      NB.entries[i].pinned_cowns = NULL;
-      NB.entries[i].pinned_count = 0;
-    }
-  }
-  NB.count = 0;
-  memset(NB.entries, 0, sizeof(NB.entries));
-
-  // Bump the version under mutex; see noticeboard_write_direct for
-  // rationale.
-  atomic_fetch_add(&NB_VERSION, 1);
-
-  mtx_unlock(&NB.mutex);
-
-  for (int i = 0; i < to_free_count; i++) {
-    XIDATA_FREE(to_free[i]);
-  }
-  for (int i = 0; i < to_unpin_entries; i++) {
-    for (int j = 0; j < to_unpin_count[i]; j++) {
-      COWN_DECREF(to_unpin[i][j]);
-    }
-    PyMem_RawFree(to_unpin[i]);
-  }
-
-  // Drop this thread's cache so a subsequent runtime cycle does not
-  // reuse a stale proxy. Other threads will revalidate via NB_VERSION.
-  nb_drop_local_cache();
-
+  noticeboard_clear();
   Py_RETURN_NONE;
 }
 
@@ -1742,7 +576,7 @@ static PyObject *_core_noticeboard_delete(PyObject *self, PyObject *args) {
     return NULL;
   }
 
-  if (nb_check_noticeboard_thread("noticeboard_delete") < 0) {
+  if (noticeboard_check_thread("noticeboard_delete") < 0) {
     return NULL;
   }
 
@@ -1753,64 +587,11 @@ static PyObject *_core_noticeboard_delete(PyObject *self, PyObject *args) {
     return NULL;
   }
 
-  if (key_len >= NB_KEY_SIZE) {
-    PyErr_SetString(PyExc_ValueError,
-                    "noticeboard key too long (max 63 UTF-8 bytes)");
+  if (noticeboard_delete(key, key_len) < 0) {
     return NULL;
   }
 
-  if (memchr(key, '\0', key_len) != NULL) {
-    PyErr_SetString(PyExc_ValueError,
-                    "noticeboard key must not contain NUL characters");
-    return NULL;
-  }
-
-  mtx_lock(&NB.mutex);
-
-  int found = -1;
-  for (int i = 0; i < NB.count; i++) {
-    if (strncmp(NB.entries[i].key, key, NB_KEY_SIZE) == 0) {
-      found = i;
-      break;
-    }
-  }
-
-  // Stash the entry's XIData and pins to free after releasing the mutex.
-  XIDATA_T *deleted_value = NULL;
-  BOCCown **deleted_pins = NULL;
-  int deleted_pin_count = 0;
-
-  if (found >= 0) {
-    deleted_value = NB.entries[found].value;
-    deleted_pins = NB.entries[found].pinned_cowns;
-    deleted_pin_count = NB.entries[found].pinned_count;
-
-    // shift remaining entries down
-    for (int i = found; i < NB.count - 1; i++) {
-      NB.entries[i] = NB.entries[i + 1];
-    }
-
-    // clear the last slot and decrement
-    memset(&NB.entries[NB.count - 1], 0, sizeof(NoticeboardEntry));
-    NB.count--;
-
-    // Bump the version under mutex; see noticeboard_write_direct.
-    atomic_fetch_add(&NB_VERSION, 1);
-  }
-
-  mtx_unlock(&NB.mutex);
-
-  if (deleted_value != NULL) {
-    XIDATA_FREE(deleted_value);
-  }
-  if (deleted_pins != NULL) {
-    for (int i = 0; i < deleted_pin_count; i++) {
-      COWN_DECREF(deleted_pins[i]);
-    }
-    PyMem_RawFree(deleted_pins);
-  }
-
-  // Note: this thread's NB_SNAPSHOT_CACHE is intentionally NOT cleared;
+  // Note: this thread's snapshot cache is intentionally NOT cleared;
   // the no-polling invariant applies equally to deletes.
 
   Py_RETURN_NONE;
@@ -1829,9 +610,7 @@ static PyObject *_core_noticeboard_delete(PyObject *self, PyObject *args) {
 static PyObject *_core_noticeboard_cache_clear(PyObject *self,
                                                PyObject *Py_UNUSED(args)) {
   BOC_STATE_SET(self);
-
-  NB_VERSION_CHECKED = false;
-
+  noticeboard_cache_clear_for_behavior();
   Py_RETURN_NONE;
 }
 
@@ -1849,7 +628,7 @@ static PyObject *_core_noticeboard_cache_clear(PyObject *self,
 static PyObject *_core_noticeboard_version(PyObject *self,
                                            PyObject *Py_UNUSED(args)) {
   BOC_STATE_SET(self);
-  return PyLong_FromLongLong((long long)atomic_load(&NB_VERSION));
+  return PyLong_FromLongLong((long long)noticeboard_version());
 }
 
 /// @brief Register the calling thread as the noticeboard mutator thread
@@ -1871,17 +650,7 @@ static PyObject *_core_set_noticeboard_thread(PyObject *self,
                     "interpreter");
     return NULL;
   }
-  uintptr_t tid = (uintptr_t)PyThread_get_thread_ident();
-  // One-shot per runtime: refuse if the slot is already owned.
-  // clear_noticeboard_thread() resets NB_NOTICEBOARD_TID to 0 at stop(),
-  // so a fresh start() cycle is fine. This closes the hijack-the-
-  // mutator-slot hole identified by the security lens.
-  intptr_t expected = 0;
-  if (!atomic_compare_exchange_strong_intptr(&NB_NOTICEBOARD_TID, &expected,
-                                             (intptr_t)tid)) {
-    PyErr_SetString(PyExc_RuntimeError,
-                    "set_noticeboard_thread: noticeboard mutator thread "
-                    "is already registered");
+  if (noticeboard_set_thread() < 0) {
     return NULL;
   }
   Py_RETURN_NONE;
@@ -1904,7 +673,7 @@ static PyObject *_core_clear_noticeboard_thread(PyObject *self,
                     "primary interpreter");
     return NULL;
   }
-  (void)atomic_exchange_intptr(&NB_NOTICEBOARD_TID, (intptr_t)0);
+  noticeboard_clear_thread();
   Py_RETURN_NONE;
 }
 
@@ -1919,8 +688,7 @@ static PyObject *_core_clear_noticeboard_thread(PyObject *self,
 static PyObject *_core_notice_sync_request(PyObject *self,
                                            PyObject *Py_UNUSED(args)) {
   BOC_STATE_SET(self);
-  int_least64_t seq = atomic_fetch_add(&NB_SYNC_REQUESTED, 1) + 1;
-  return PyLong_FromLongLong((long long)seq);
+  return PyLong_FromLongLong((long long)notice_sync_request());
 }
 
 /// @brief Mark a notice_sync sequence as processed and wake waiters.
@@ -1944,20 +712,7 @@ static PyObject *_core_notice_sync_complete(PyObject *self, PyObject *args) {
     return NULL;
   }
 
-  Py_BEGIN_ALLOW_THREADS mtx_lock(&NB_SYNC_MUTEX);
-  // Defense in depth: with a single noticeboard thread draining the
-  // FIFO boc_noticeboard tag, `seq` arrives strictly monotonically and
-  // a plain `atomic_store(seq)` would be correct. We keep the max-of
-  // pattern so that if a future change introduces a second mutator
-  // thread or any out-of-order delivery, NB_SYNC_PROCESSED can never
-  // regress and unblock waiters early. Both load and store happen under
-  // NB_SYNC_MUTEX (the only writer is here), so this is not a TOCTOU.
-  int_least64_t cur = atomic_load(&NB_SYNC_PROCESSED);
-  if ((int_least64_t)seq > cur) {
-    atomic_store(&NB_SYNC_PROCESSED, (int_least64_t)seq);
-  }
-  cnd_broadcast(&NB_SYNC_COND);
-  mtx_unlock(&NB_SYNC_MUTEX);
+  Py_BEGIN_ALLOW_THREADS notice_sync_complete((int_least64_t)seq);
   Py_END_ALLOW_THREADS
 
       Py_RETURN_NONE;
@@ -1979,34 +734,26 @@ static PyObject *_core_notice_sync_wait(PyObject *self, PyObject *args) {
     return NULL;
   }
 
-  bool do_timeout = false;
-  double end_time = 0.0;
-  if (timeout_obj != Py_None) {
-    double timeout = PyFloat_AsDouble(timeout_obj);
+  bool wait_forever = false;
+  double timeout = 0.0;
+  if (timeout_obj == Py_None) {
+    wait_forever = true;
+  } else {
+    timeout = PyFloat_AsDouble(timeout_obj);
     if (timeout == -1.0 && PyErr_Occurred()) {
       return NULL;
     }
-    if (timeout >= 0.0) {
-      do_timeout = true;
-      end_time = boc_now_s() + timeout;
+    // Boundary validation: rejects NaN as ValueError, maps +Inf to
+    // wait_forever, clamps negatives to 0. Centralised so future
+    // wait entry points can reuse it.
+    if (boc_validate_finite_timeout(timeout, &timeout, &wait_forever) < 0) {
+      return NULL;
     }
   }
 
-  bool ok = true;
-  Py_BEGIN_ALLOW_THREADS mtx_lock(&NB_SYNC_MUTEX);
-  while (atomic_load(&NB_SYNC_PROCESSED) < (int_least64_t)my_seq) {
-    if (do_timeout) {
-      double now = boc_now_s();
-      if (now >= end_time) {
-        ok = false;
-        break;
-      }
-      cnd_timedwait_s(&NB_SYNC_COND, &NB_SYNC_MUTEX, end_time - now);
-    } else {
-      cnd_wait(&NB_SYNC_COND, &NB_SYNC_MUTEX);
-    }
-  }
-  mtx_unlock(&NB_SYNC_MUTEX);
+  bool ok;
+  Py_BEGIN_ALLOW_THREADS ok =
+      notice_sync_wait((int_least64_t)my_seq, timeout, wait_forever);
   Py_END_ALLOW_THREADS
 
       if (ok) {
@@ -2035,20 +782,7 @@ static PyObject *_core_notice_sync_wait(PyObject *self, PyObject *args) {
 static PyObject *_core_terminator_inc(PyObject *self,
                                       PyObject *Py_UNUSED(args)) {
   BOC_STATE_SET(self);
-  if (atomic_load(&TERMINATOR_CLOSED)) {
-    return PyLong_FromLongLong(-1);
-  }
-  int_least64_t newval = atomic_fetch_add(&TERMINATOR_COUNT, 1) + 1;
-  if (atomic_load(&TERMINATOR_CLOSED)) {
-    int_least64_t after = atomic_fetch_add(&TERMINATOR_COUNT, -1) - 1;
-    if (after == 0) {
-      mtx_lock(&TERMINATOR_MUTEX);
-      cnd_broadcast(&TERMINATOR_COND);
-      mtx_unlock(&TERMINATOR_MUTEX);
-    }
-    return PyLong_FromLongLong(-1);
-  }
-  return PyLong_FromLongLong((long long)newval);
+  return PyLong_FromLongLong((long long)terminator_inc());
 }
 
 /// @brief Decrement the terminator. Wakes terminator_wait on 0-transition.
@@ -2058,13 +792,7 @@ static PyObject *_core_terminator_inc(PyObject *self,
 static PyObject *_core_terminator_dec(PyObject *self,
                                       PyObject *Py_UNUSED(args)) {
   BOC_STATE_SET(self);
-  int_least64_t newval = atomic_fetch_add(&TERMINATOR_COUNT, -1) - 1;
-  if (newval == 0) {
-    mtx_lock(&TERMINATOR_MUTEX);
-    cnd_broadcast(&TERMINATOR_COND);
-    mtx_unlock(&TERMINATOR_MUTEX);
-  }
-  return PyLong_FromLongLong((long long)newval);
+  return PyLong_FromLongLong((long long)terminator_dec());
 }
 
 /// @brief Set the closed bit. Future terminator_inc() calls return -1.
@@ -2080,7 +808,7 @@ static PyObject *_core_terminator_close(PyObject *self,
                     "interpreter");
     return NULL;
   }
-  atomic_store(&TERMINATOR_CLOSED, 1);
+  terminator_close();
   Py_RETURN_NONE;
 }
 
@@ -2097,34 +825,25 @@ static PyObject *_core_terminator_wait(PyObject *self, PyObject *args) {
     return NULL;
   }
 
-  bool do_timeout = false;
-  double end_time = 0.0;
-  if (timeout_obj != Py_None) {
-    double timeout = PyFloat_AsDouble(timeout_obj);
+  bool wait_forever = false;
+  double timeout = 0.0;
+  if (timeout_obj == Py_None) {
+    wait_forever = true;
+  } else {
+    timeout = PyFloat_AsDouble(timeout_obj);
     if (timeout == -1.0 && PyErr_Occurred()) {
       return NULL;
     }
-    if (timeout >= 0.0) {
-      do_timeout = true;
-      end_time = boc_now_s() + timeout;
+    // Boundary validation: rejects NaN as ValueError, maps +Inf to
+    // wait_forever, clamps negatives to 0. Centralised so future
+    // wait entry points can reuse it.
+    if (boc_validate_finite_timeout(timeout, &timeout, &wait_forever) < 0) {
+      return NULL;
     }
   }
 
-  bool ok = true;
-  Py_BEGIN_ALLOW_THREADS mtx_lock(&TERMINATOR_MUTEX);
-  while (atomic_load(&TERMINATOR_COUNT) != 0) {
-    if (do_timeout) {
-      double now = boc_now_s();
-      if (now >= end_time) {
-        ok = false;
-        break;
-      }
-      cnd_timedwait_s(&TERMINATOR_COND, &TERMINATOR_MUTEX, end_time - now);
-    } else {
-      cnd_wait(&TERMINATOR_COND, &TERMINATOR_MUTEX);
-    }
-  }
-  mtx_unlock(&TERMINATOR_MUTEX);
+  bool ok;
+  Py_BEGIN_ALLOW_THREADS ok = terminator_wait(timeout, wait_forever);
   Py_END_ALLOW_THREADS
 
       if (ok) {
@@ -2150,14 +869,7 @@ static PyObject *_core_terminator_seed_dec(PyObject *self,
                     "interpreter");
     return NULL;
   }
-  int_least64_t prev = atomic_exchange(&TERMINATOR_SEEDED, 0);
-  if (prev == 1) {
-    int_least64_t newval = atomic_fetch_add(&TERMINATOR_COUNT, -1) - 1;
-    if (newval == 0) {
-      mtx_lock(&TERMINATOR_MUTEX);
-      cnd_broadcast(&TERMINATOR_COND);
-      mtx_unlock(&TERMINATOR_MUTEX);
-    }
+  if (terminator_seed_dec()) {
     Py_RETURN_TRUE;
   }
   Py_RETURN_FALSE;
@@ -2182,21 +894,9 @@ static PyObject *_core_terminator_reset(PyObject *self,
                     "interpreter");
     return NULL;
   }
-  // Fence: raise the closed bit before we touch anything else so any
-  // stray thread still holding a reference to the previous runtime
-  // (e.g. a late whencall call) is refused by terminator_inc rather
-  // than slipping a new behavior past the reset boundary. We clear
-  // the bit again at the end, once the new COUNT/SEEDED values have
-  // been published, so a fresh start() sees closed=0.
-  atomic_store(&TERMINATOR_CLOSED, 1);
-  mtx_lock(&TERMINATOR_MUTEX);
-  int_least64_t prior_count = atomic_load(&TERMINATOR_COUNT);
-  int_least64_t prior_seeded = atomic_load(&TERMINATOR_SEEDED);
-  atomic_store(&TERMINATOR_COUNT, 1);
-  atomic_store(&TERMINATOR_SEEDED, 1);
-  atomic_store(&TERMINATOR_CLOSED, 0);
-  cnd_broadcast(&TERMINATOR_COND);
-  mtx_unlock(&TERMINATOR_MUTEX);
+  int_least64_t prior_count = 0;
+  int_least64_t prior_seeded = 0;
+  terminator_reset(&prior_count, &prior_seeded);
   return Py_BuildValue("(LL)", (long long)prior_count, (long long)prior_seeded);
 }
 
@@ -2207,7 +907,7 @@ static PyObject *_core_terminator_reset(PyObject *self,
 static PyObject *_core_terminator_seeded(PyObject *self,
                                          PyObject *Py_UNUSED(args)) {
   BOC_STATE_SET(self);
-  return PyLong_FromLongLong((long long)atomic_load(&TERMINATOR_SEEDED));
+  return PyLong_FromLongLong((long long)terminator_seeded());
 }
 
 /// @brief Read the current terminator count (for reconciliation tests).
@@ -2217,7 +917,7 @@ static PyObject *_core_terminator_seeded(PyObject *self,
 static PyObject *_core_terminator_count(PyObject *self,
                                         PyObject *Py_UNUSED(args)) {
   BOC_STATE_SET(self);
-  return PyLong_FromLongLong((long long)atomic_load(&TERMINATOR_COUNT));
+  return PyLong_FromLongLong((long long)terminator_count());
 }
 
 /// @details This can be safely referenced and used from multiple processes.
@@ -2236,8 +936,13 @@ typedef struct boc_cown {
   BOCRecycleQueue *recycle_queue;
   /// @brief The ID of the interpreter that currently has acquired this cown.
   atomic_int_least64_t owner;
-  /// @brief The last behavior which needs to acquire this cown
-  atomic_intptr_t last; // (BOCBehavior *)
+  /// @brief The last request enqueued on this cown's MCS chain.
+  /// @details Stores @c (BOCRequest *) (matching Verona's
+  /// @c Slot* in @c boc/cown.h). Updated by
+  /// @c request_start_enqueue_inner via @c atomic_exchange on the
+  /// 2PL link path; read by successors to discover their
+  /// predecessor.
+  atomic_intptr_t last; // (BOCRequest *)
   /// @brief Atomic reference count for the cown
   atomic_int_least64_t rc;
   /// @brief Atomic weak reference count for the cown
@@ -2291,7 +996,18 @@ static void BOCRecycleQueue_enqueue(BOCRecycleQueue *queue, XIDATA_T *xidata);
 /// @brief Atomic decref for the cown
 /// @param cown the cown to decref
 /// @return the new reference count
-static int_least64_t cown_decref(BOCCown *cown) {
+// Within this TU we want every COWN_INCREF / COWN_DECREF callsite below
+// to inline directly into its caller — losing that on the schedule /
+// release hot path costs measurable throughput. Mirror CPython's
+// Py_INCREF (inline header macro) vs _Py_IncRef (out-of-line ABI export)
+// pattern: keep `static inline` bodies as the in-TU implementation,
+// expose extern wrappers under the names declared in `cown.h` for
+// noticeboard.c, and override the macros from cown.h to bind locally to
+// the inline versions. The one earlier callsite (the write_direct error
+// rollback above this point) is on an error path and stays bound to the
+// extern wrapper from cown.h — not hot.
+
+static inline int_least64_t cown_decref_inline(BOCCown *cown) {
   int_least64_t rc = atomic_fetch_add(&cown->rc, -1) - 1;
   PRINTDBG("cown_decref(%p, cid=%" PRIdLEAST64 ") = %" PRIdLEAST64 "\n", cown,
            cown->id, rc);
@@ -2320,18 +1036,32 @@ static int_least64_t cown_decref(BOCCown *cown) {
   return 0;
 }
 
+/// @brief Out-of-line export consumed by other TUs (see @ref cown.h).
+int_least64_t cown_decref(BOCCown *cown) { return cown_decref_inline(cown); }
+
 #define COWN_WEAK_DECREF(c) cown_weak_decref(c)
 
 /// @brief Atomic incref for the cown
 /// @param cown the cown to incref
 /// @return the new reference count
-static int_least64_t cown_incref(BOCCown *cown) {
+static inline int_least64_t cown_incref_inline(BOCCown *cown) {
   int_least64_t rc = atomic_fetch_add(&cown->rc, 1) + 1;
   PRINTDBG("cown_incref(%p, cid=%" PRIdLEAST64 ") = %" PRIdLEAST64 "\n", cown,
            cown->id, rc);
   return rc;
 }
 
+/// @brief Out-of-line export consumed by other TUs (see @ref cown.h).
+int_least64_t cown_incref(BOCCown *cown) { return cown_incref_inline(cown); }
+
+// Rebind COWN_INCREF / COWN_DECREF to the inline forms so every
+// remaining callsite below (acquire/release/dispatch hot paths) does
+// not pay an indirect call.
+#undef COWN_INCREF
+#undef COWN_DECREF
+#define COWN_INCREF(c) cown_incref_inline((c))
+#define COWN_DECREF(c) cown_decref_inline((c))
+
 static inline int_least64_t cown_weak_incref(BOCCown *cown) {
   int_least64_t rc = atomic_fetch_add(&cown->weak_rc, 1) + 1;
   PRINTDBG("cown_weak_incref(%p, cid=%" PRIdLEAST64 ") = %" PRIdLEAST64 "\n",
@@ -2743,6 +1473,13 @@ static PyObject *CownCapsule_acquired(PyObject *op,
 }
 
 /// @brief Attempts to acquire the cown
+/// @note On failure, the cown's owner is restored to its prior value: either
+/// NO_OWNER (if deserialisation failed after the CAS succeeded) or the actual
+/// owning interpreter (if the CAS itself failed). Callers can therefore rely
+/// on the invariant that a -1 return never leaves the cown in a half-acquired
+/// (owner=me, value=NULL, xidata non-NULL) state. This is required by the
+/// worker-side recovery arm in `worker.run_behavior`, which calls
+/// `behavior.release()` after an acquire failure.
 /// @param cown The cown to acquire
 /// @return -1 if failure, 0 if success
 static int cown_acquire(BOCCown *cown) {
@@ -2766,6 +1503,13 @@ static int cown_acquire(BOCCown *cown) {
   cown->value = xidata_to_object(cown->xidata, cown->pickled);
 
   if (cown->value == NULL) {
+    // Deserialisation failed. We CAS'd owner from NO_OWNER to desired above,
+    // so we must roll it back; otherwise the cown is permanently stuck in a
+    // (owner=me, value=NULL, xidata non-NULL) half-acquired state and any
+    // future acquire from any interpreter (including the worker-side
+    // recovery arm) sees "already acquired by N" instead of being able to
+    // retry. xidata stays in place for a future retry.
+    atomic_store(&cown->owner, (int_least64_t)NO_OWNER);
     return -1;
   }
 
@@ -3219,8 +1963,17 @@ static int _cown_shared(
 }
 
 /// @brief Frees a message
+/// @details Releases @c message->tag (an owning reference taken by
+/// @c boc_message_new) and any pending xidata, then frees the message
+/// struct itself. Safe to call on a partially-initialized message:
+/// @c boc_message_new zero-fills the allocation, so any field that
+/// has not yet been assigned reads back as NULL and the corresponding
+/// TAG_DECREF / xidata recycle arms are skipped.
 /// @param message The message to free
 static void boc_message_free(BOCMessage *message) {
+  if (message->tag != NULL) {
+    TAG_DECREF(message->tag);
+  }
   if (message->xidata != NULL) {
     BOCRecycleQueue_enqueue(message->recycle_queue, message->xidata);
   }
@@ -3426,21 +2179,58 @@ static BOCQueue *get_queue_for_tag(PyObject *tag) {
     // check to see if another interpreter has used this queue
     int_least64_t expected = BOC_QUEUE_UNASSIGNED;
     int_least64_t desired = BOC_QUEUE_ASSIGNED;
-    if (atomic_compare_exchange_strong(&qptr->state, &expected, desired)) {
-      // we're the first, this is the new dedicated queue for this tag
-      PRINTDBG("Assigning ");
-      PRINTOBJDBG(tag);
-      PRINTFDBG(" to queue %zu\n", i);
-      BOCTag *qtag = tag_from_PyUnicode(tag, qptr);
-      if (qtag == NULL) {
+    // Pre-check the slot state with a non-allocating load before
+    // committing to a `tag_from_PyUnicode` allocation. Iterating
+    // across many already-ASSIGNED slots while looking for the
+    // dedicated queue of a new tag must NOT allocate per iteration:
+    // the CAS would fail on every ASSIGNED slot and the speculative
+    // tag would immediately be `tag_release`d, turning a cold-start
+    // queue scan into O(BOC_QUEUE_COUNT) malloc/free pairs.
+    //
+    // Only attempt the publish-before-CAS allocation when the slot
+    // is actually UNASSIGNED. The CAS that follows is still needed
+    // to win the slot against a racing peer; on CAS loss we tag-
+    // release and fall through to the discovery branch below
+    // exactly as the prior code did.
+    int_least64_t observed = atomic_load_intptr(&qptr->state);
+    if (observed == BOC_QUEUE_UNASSIGNED) {
+      // Allocate the tag *before* the CAS so that an allocation failure
+      // (UTF-8 error / OOM in tag_from_PyUnicode) leaves the slot in
+      // BOC_QUEUE_UNASSIGNED — peer interpreters can re-attempt and we
+      // never publish ASSIGNED-with-NULL-tag (which would wedge readers
+      // in the busy-wait below). The new tag arrives with rc=1; on CAS
+      // loss we tag_release it (the slot is owned by some other peer
+      // who is responsible for publishing their own tag).
+      BOCTag *new_tag = tag_from_PyUnicode(tag, qptr);
+      if (new_tag == NULL) {
         return NULL;
       }
+      if (atomic_compare_exchange_strong(&qptr->state, &expected, desired)) {
+        // we're the first, this is the new dedicated queue for this tag
+        PRINTDBG("Assigning ");
+        PRINTOBJDBG(tag);
+        PRINTFDBG(" to queue %zu\n", i);
+        // Publish the tag pointer with release semantics so the busy-wait
+        // below sees the non-NULL tag after observing ASSIGNED. The tag
+        // already has rc=1 (queue's owning reference). We then add the
+        // per-interpreter cache reference (rc=2). This replaces the prior
+        // rc=0-then-double-INCREF idiom whose incref window allowed a
+        // racing TAG_DECREF to free a freshly published tag.
+        atomic_store_intptr(&qptr->tag, (intptr_t)new_tag);
+        BOC_STATE->queue_tags[i] = new_tag;
+        TAG_INCREF(new_tag);
+        return qptr;
+      }
 
-      atomic_store_intptr(&qptr->tag, (intptr_t)qtag);
-      TAG_INCREF(qtag);
-      BOC_STATE->queue_tags[i] = qtag;
-      TAG_INCREF(qtag);
-      return qptr;
+      // CAS lost — another interpreter assigned this slot first. Release
+      // our speculative allocation; we'll fall through to the post-CAS
+      // discovery branch below to pick up the winner's tag.
+      TAG_DECREF(new_tag);
+    } else {
+      // Slot was already ASSIGNED (or DISABLED) when we looked. Mirror
+      // the post-CAS-failure exit values so the discovery branch below
+      // sees the same `expected` it would have gotten from a failed CAS.
+      expected = observed;
     }
 
     // this queue has already been assigned
@@ -3455,6 +2245,8 @@ static BOCQueue *get_queue_for_tag(PyObject *tag) {
       qtag = (BOCTag *)atomic_load_intptr(&qptr->tag);
     }
 
+    // Discovery path: the qptr->tag pointer is owned by the publisher's
+    // queue reference. Add a per-interpreter cache reference.
     BOC_STATE->queue_tags[i] = qtag;
     TAG_INCREF(qtag);
 
@@ -3496,7 +2288,14 @@ static BOCQueue *get_queue_for_tag(PyObject *tag) {
 /// @param contents The contents of the message.
 /// @return A message object
 static BOCMessage *boc_message_new(PyObject *tag, PyObject *contents) {
-  BOCMessage *message = (BOCMessage *)PyMem_RawMalloc(sizeof(BOCMessage));
+  // Zero-init so any later boc_message_free on a partially-built
+  // message sees NULL for `tag`, `xidata`, and `recycle_queue` and
+  // safely no-ops the TAG_DECREF / BOCRecycleQueue_enqueue arms.
+  // Without this, callers must remember to PyMem_RawFree (rather
+  // than boc_message_free) on every early-error path that occurs
+  // before the explicit field assignments below — an invariant
+  // that is easy to break when adding new failure points.
+  BOCMessage *message = (BOCMessage *)PyMem_RawCalloc(1, sizeof(BOCMessage));
   if (message == NULL) {
     PyErr_NoMemory();
     return NULL;
@@ -3505,17 +2304,34 @@ static BOCMessage *boc_message_new(PyObject *tag, PyObject *contents) {
   BOCQueue *qptr = get_queue_for_tag(tag);
   if (qptr == NULL) {
     PyMem_RawFree(message);
-    PyErr_Format(PyExc_KeyError,
-                 "No queue available for tag %R: tag capacity exceeded", tag);
+    // Only set the capacity-exhaustion KeyError if get_queue_for_tag
+    // did not already raise (e.g. UnicodeEncodeError on surrogates,
+    // PyMem_RawMalloc OOM in tag_from_PyUnicode). Overwriting a
+    // pending exception masks the true failure cause.
+    if (!PyErr_Occurred()) {
+      PyErr_Format(PyExc_KeyError,
+                   "No queue available for tag %R: tag capacity exceeded", tag);
+    }
     return NULL;
   }
 
   BOCTag *qtag = (BOCTag *)atomic_load_intptr(&qptr->tag);
   if (qtag == NULL) {
-    // non-assigned tag
+    // non-assigned tag — allocate one for this message. The new tag
+    // arrives with rc=1; ownership transfers to message->tag and is
+    // released by boc_message_free.
     message->tag = tag_from_PyUnicode(tag, qptr);
+    if (message->tag == NULL) {
+      PyMem_RawFree(message);
+      return NULL;
+    }
   } else {
+    // qtag is owned by qptr->tag (publisher's queue reference). Take
+    // a separate owning reference for message->tag so a concurrent
+    // set_tags that swaps qptr->tag and tag_disables the old one
+    // does not free it out from under us.
     message->tag = qtag;
+    TAG_INCREF(message->tag);
   }
 
   message->recycle_queue = BOC_STATE->recycle_queue;
@@ -3540,13 +2356,13 @@ static BOCMessage *boc_message_new(PyObject *tag, PyObject *contents) {
 }
 
 /// @brief Enqueues a message.
-/// @details The @c boc_worker queue is a fixed-capacity ring
+/// @details Each tag's message queue is a fixed-capacity ring
 /// (@c BOC_CAPACITY = 16384 slots). Reaching that bound requires more
-/// than 16k behaviors to be simultaneously runnable but not yet picked
-/// up by any worker -- in practice, only a producer scheduling against
-/// many disjoint cowns far faster than every worker can drain. MCS
-/// chaining keeps behaviors that share a cown out of the queue until
-/// their predecessor releases, so chains do not exhaust capacity.
+/// than 16k messages on a single tag to be queued without any
+/// consumer draining -- in practice this only happens for a tag
+/// where producers vastly outpace consumers. Behaviour dispatch
+/// does not go through a tag at all (it routes through per-worker
+/// queues in @c sched.c).
 ///
 /// On overflow this returns -1 without setting a Python exception; the
 /// caller (typically @c behavior_resolve_one) reports the error. Once
@@ -3584,6 +2400,8 @@ static int boc_enqueue(BOCMessage *message) {
       assert(qptr->messages[tail % BOC_CAPACITY] == NULL);
       qptr->messages[tail % BOC_CAPACITY] = message;
 
+      boc_atomic_fetch_add_u64_explicit(&qptr->pushed_total, 1, BOC_MO_RELAXED);
+
       // If any receiver is parked on this queue's condvar, wake it.
       // The seq_cst load synchronizes with the consumer's seq_cst increment
       // of waiters, ensuring that either we see the waiter and signal, or the
@@ -3598,6 +2416,8 @@ static int boc_enqueue(BOCMessage *message) {
     }
 
     // someone else got there first, try again
+    boc_atomic_fetch_add_u64_explicit(&qptr->enqueue_cas_retries, 1,
+                                      BOC_MO_RELAXED);
   }
 
   return -1;
@@ -3641,6 +2461,8 @@ static int_least64_t boc_dequeue(PyObject *tag, BOCMessage **message) {
       PRINTDBG("Unable to dequeue at head=%" PRIdLEAST64 "\n", head);
 
       // someone else already consumed this, try again
+      boc_atomic_fetch_add_u64_explicit(&qptr->dequeue_cas_retries, 1,
+                                        BOC_MO_RELAXED);
       tail = atomic_load(&qptr->tail);
       continue;
     }
@@ -3654,6 +2476,7 @@ static int_least64_t boc_dequeue(PyObject *tag, BOCMessage **message) {
 
     *message = qptr->messages[index];
     qptr->messages[index] = NULL;
+    boc_atomic_fetch_add_u64_explicit(&qptr->popped_total, 1, BOC_MO_RELAXED);
     PRINTFDBG("Dequeued %s from q%" PRIdLEAST64 "[%" PRIdLEAST64
               "] (%" PRIdLEAST64 " - %" PRIdLEAST64 " = %" PRIdLEAST64 ")\n",
               (*message)->tag->str, qptr->index, head, tail, head + 1,
@@ -3664,26 +2487,6 @@ static int_least64_t boc_dequeue(PyObject *tag, BOCMessage **message) {
   return -1;
 }
 
-/// @brief Returns the current time as double-precision seconds.
-/// @return the current time
-static double boc_now_s() {
-  const double S_PER_NS = 1.0e-9;
-  struct timespec ts;
-  // Prefer clock_gettime on POSIX: timespec_get requires macOS 10.15+ while
-  // Python's default macOS deployment target is older, producing an
-  // -Wunguarded-availability-new warning. clock_gettime has been available on
-  // macOS since 10.12. Windows UCRT provides timespec_get but not
-  // clock_gettime, so fall back there.
-#ifdef _WIN32
-  timespec_get(&ts, TIME_UTC);
-#else
-  clock_gettime(CLOCK_REALTIME, &ts);
-#endif
-  double time = (double)ts.tv_sec;
-  time += ts.tv_nsec * S_PER_NS;
-  return time;
-}
-
 /// @brief Sends a message
 /// @param module The _core module
 /// @param args The message to send
@@ -4147,15 +2950,48 @@ typedef struct behavior_s {
   struct boc_request **requests;
   /// @brief Number of entries in @c requests (post-dedup, ≤ args_size + 1).
   Py_ssize_t requests_size;
-  /// @brief Pre-built dispatch message for the BehaviorCapsule.
-  /// @details Allocated by behavior_prepare_start before the 2PL link loop,
-  /// claimed by the unique caller that observes @c count → 0 inside
-  /// behavior_resolve_one. Targets @c boc_worker directly with the bare
-  /// BehaviorCapsule as the payload. Visibility is carried by the acq-rel
-  /// fetch_sub on @c count — no separate atomic on this field is required.
-  /// Freed defensively by behavior_free if a behavior is destroyed without
-  /// dispatching.
-  struct boc_message *start_message;
+  /// @brief Intrusive link node for the Verona-style behaviour MPMC
+  /// queue (`boc_bq_*` API in `sched.{h,c}`).
+  /// @details Ports `verona-rt/src/rt/sched/work.h::Work::next_in_queue`.
+  /// Initialised to NULL in @c behavior_new under the GIL, before the
+  /// behaviour can be reached from any other thread (preserves the
+  /// link-loop infallibility invariant). Hooked into the `boc_bq_*`
+  /// enqueue/dequeue path by `behavior_resolve_one` and
+  /// `request_release_inner`. Placement at struct end is
+  /// `pahole`-driven to keep the hot fields on their existing cache
+  /// lines.
+  boc_bq_node_t bq_node;
+  /// @brief Fairness-token discriminator.
+  /// @details 0 for ordinary behaviours; 1 for the per-worker
+  /// @c token_work sentinel allocated by
+  /// @ref _core_scheduler_runtime_start. The worker-pop site checks
+  /// this field on every successful pop; if set, the dispatch path
+  /// flips @c should_steal_for_fairness on the popping worker and
+  /// re-enqueues the token instead of calling @c run_behavior.
+  /// Verona equivalent: @c Core::token_work + @c is_token discriminator
+  /// (`verona-rt/src/rt/sched/core.h:22-37`). Trailing position keeps
+  /// the hot fields (count, rc, thunk) on their existing cache lines;
+  /// the byte costs an 8-byte tail pad on x86_64.
+  uint8_t is_token;
+  /// @brief Index of the worker that owns this fairness token (or
+  /// @c -1 for ordinary behaviours).
+  /// @details The fairness arm in @ref boc_sched_worker_pop_slow
+  /// re-enqueues a worker's token from its own @c token_work slot,
+  /// so the heartbeat needs to land back on the owning worker even
+  /// when the token was consumed by a thief. The dispatch loop in
+  /// @ref _core_scheduler_worker_pop reads this field and calls
+  /// @ref boc_sched_set_steal_flag on the owner — never on the
+  /// consumer — so the owner's next @c pop_fast routes through
+  /// @c pop_slow and re-enqueues its own token. Verona's
+  /// equivalent is the captured @c this in @c Closure::make
+  /// (`core.h:24-32`): the closure body sets the OWNING core's
+  /// flag, not the running thread's.
+  ///
+  /// Width: @c int16_t. Sized to comfortably exceed any plausible
+  /// worker count (≤32767) while preserving the existing 8-byte
+  /// trailing pad with @c is_token; struct size is unchanged from
+  /// the original @c int8_t encoding (verified by pahole).
+  int16_t owner_worker_index;
 } BOCBehavior;
 
 /// @brief Capsule for holding a pointer to a behavior
@@ -4166,6 +3002,18 @@ typedef struct behavior_capsule_object {
 #define BehaviorCapsule_CheckExact(op)                                         \
   Py_IS_TYPE((op), BOC_STATE->behavior_capsule_type)
 
+/// @brief Recover the enclosing @c BOCBehavior from its embedded
+/// @c bq_node.
+/// @details The dispatch path moves @c BOCBehavior * pointers
+/// through the scheduler queue indirectly: the producer hands
+/// @c &behavior->bq_node to @ref boc_sched_dispatch, the consumer
+/// pops a @c boc_bq_node_t * back, and this macro reverses the
+/// embedding offset to recover the owning @c BOCBehavior. Equivalent
+/// to the kernel's @c container_of pattern; @c offsetof is the
+/// portable C11 idiom.
+#define BEHAVIOR_FROM_BQ_NODE(node_ptr)                                        \
+  ((BOCBehavior *)((char *)(node_ptr) - offsetof(BOCBehavior, bq_node)))
+
 // Forward declaration: defined alongside the request helpers further down.
 // behavior_free uses it to clean up any unreleased request array if a
 // behavior is destroyed without going through behavior_release_all.
@@ -4190,7 +3038,16 @@ BOCBehavior *behavior_new() {
   behavior->captures = NULL;
   behavior->requests = NULL;
   behavior->requests_size = 0;
-  behavior->start_message = NULL;
+  // Init the boc_bq link before the behaviour becomes reachable from
+  // any other thread (we are still under the GIL here). The boc_bq_*
+  // enqueue path requires this field to start NULL.
+  boc_atomic_store_ptr_explicit(&behavior->bq_node.next_in_queue, NULL,
+                                BOC_MO_RELAXED);
+  // Ordinary behaviours are not fairness tokens. Token allocation
+  // is performed directly in `_core_scheduler_runtime_start` and
+  // bypasses `behavior_new`.
+  behavior->is_token = 0;
+  behavior->owner_worker_index = -1;
   BOC_REF_TRACKING_ADD_BEHAVIOR();
 
   return behavior;
@@ -4245,15 +3102,6 @@ void behavior_free(BOCBehavior *behavior) {
     PyMem_RawFree(behavior->requests);
   }
 
-  if (behavior->start_message != NULL) {
-    // Defensive cleanup: prepare_start succeeded but the message was
-    // never claimed (e.g. resolve_one was never called because
-    // schedule() failed mid-link). Free the unclaimed message — it
-    // never made it onto the queue, so this is just our private
-    // allocation.
-    boc_message_free(behavior->start_message);
-  }
-
   if (behavior->thunk != NULL) {
     BOCTag_free(behavior->thunk);
   }
@@ -4426,7 +3274,15 @@ static int BehaviorCapsule_init(PyObject *op, PyObject *args,
     return -1;
   }
 
+  // PyMem_RawCalloc with nelem == 0 is implementation-defined (may return
+  // NULL legally), so only treat NULL as failure when args_size > 0.
   behavior->group_ids = PyMem_RawCalloc((size_t)args_size, sizeof(int));
+  if (args_size > 0 && behavior->group_ids == NULL) {
+    Py_DECREF(cowns);
+    Py_DECREF(cowns_list_fast);
+    PyErr_NoMemory();
+    return -1;
+  }
   for (Py_ssize_t i = 0; i < args_size; ++i) {
     PyObject *item = PySequence_Fast_GET_ITEM(cowns_list_fast, i);
     int group_id;
@@ -4467,98 +3323,62 @@ static int BehaviorCapsule_init(PyObject *op, PyObject *args,
 }
 
 /// @brief Resolves a single outstanding request for this behavior.
-/// @details Called when a request is at the head of the queue for a particular
-/// cown. If this is the last request, then the thunk is scheduled. The unique
-/// caller that observes count -> 0 claims the pre-built start message stashed
-/// by behavior_prepare_start and enqueues it.
-/// Visibility of the start_message pointer is carried by the acq-rel
-/// fetch_sub on count -- the only writer (prepare_start) ran before the link
-/// loop began, and only one decrementer can transition to 0. This path
-/// performs no allocation and therefore cannot fail past prepare.
+/// @details Called when a request is at the head of the queue for a
+/// particular cown. If this is the last request (count -> 0) the thunk
+/// is dispatched: the unique caller that observes the transition takes
+/// a queue-owned reference via @c BEHAVIOR_INCREF and hands
+/// @c &behavior->bq_node to @ref boc_sched_dispatch. The matching
+/// @c BEHAVIOR_DECREF runs when the consumer's freshly allocated
+/// @c BehaviorCapsule (built by @c _core.scheduler_worker_pop) is
+/// deallocated on the worker side.
 ///
-/// Returns @c int rather than @c PyObject* so the count > 0 path is
-/// pure-atomic and can be invoked from inside a @c Py_BEGIN_ALLOW_THREADS
-/// span (no @c Py_RETURN_NONE = no Py_None refcount touch). The only
-/// Python-state operation remaining is @c PyErr_SetString on the
-/// @c boc_enqueue-full error path; that path requires @c count == 0 which
-/// is unreachable mid link-loop because @c BehaviorCapsule_init sizes
-/// @c count to @c args_size + 2. Callers that hit the error path must hold
-/// the GIL.
+/// Visibility of the dispatch is carried by the acq-rel fetch_sub on
+/// @c count -- only one decrementer can transition to 0, and the
+/// behavior payload (cowns / captures / thunk) was published by
+/// @c whencall before the 2PL link loop began.
 ///
-/// If @c boc_enqueue overflows the @c boc_worker ring, this raises
-/// @c RuntimeError("Message queue is full"); see @c boc_enqueue for the
-/// queue-full failure mode and recovery analysis.
+/// **Failure surface.** @ref boc_sched_dispatch can fail when called
+/// from the off-worker arm if the runtime has been torn down. On
+/// failure the queue-owned BEHAVIOR_INCREF taken just before dispatch
+/// is rolled back here, the Python exception set by
+/// @c boc_sched_dispatch is propagated, and the caller is expected
+/// to roll back its terminator hold (the reference path is
+/// @c whencall in @c behaviors.py).
+///
+/// **Cown-side residue on dispatch failure.** When the count==0
+/// transition fires here AND @c boc_sched_dispatch returns -1
+/// (runtime-down sentinel; see @c boc_sched_dispatch in @c sched.c),
+/// the behavior's BOCRequest array has already been linked onto every
+/// target cown's MCS chain by the link/finish 2PL phases. The
+/// rollback below DECREFs only the queue-owned BEHAVIOR_INCREF; it
+/// does NOT walk and unlink the cown chains. Each request still
+/// holds its BEHAVIOR_INCREF, so the BOCBehavior cannot be freed,
+/// and no worker will ever call @c release_all on it. Any cown that
+/// happens to be linked into this stranded chain remains pinned
+/// awaiting a behavior that cannot run, until the next @c bocpy.start
+/// cycle (which frees the BOCCown via the GC of its owning Python
+/// @c Cown). This residue is intentional and only fires on the
+/// dying-runtime path; the upstream-detection alternative (an
+/// explicit @c scheduler_running check inside @c whencall before the
+/// chain link) introduces a TOCTOU window. The dedicated regression
+/// is @c test_schedule_after_runtime_stop_raises in
+/// @c test_scheduling_stress.py, which exercises this path and
+/// itself contributes one stranded chain per test process.
 /// @param behavior the behavior whose count to decrement
-/// @return 0 on success, -1 on error with a Python exception set (caller
-///         must hold the GIL on the error path)
+/// @return 0 on success, -1 if dispatch failed (Python exception set)
 static int behavior_resolve_one(BOCBehavior *behavior) {
   int_least64_t count = atomic_fetch_add(&behavior->count, -1) - 1;
   if (count == 0) {
-    BOCMessage *message = behavior->start_message;
-    behavior->start_message = NULL;
-    if (message == NULL) {
-      // Defensive: prepare_start was never called. This should not happen
-      // on the production path; raise so the failure is loud.
-      PyErr_SetString(PyExc_RuntimeError,
-                      "behavior_resolve_one: start message not prepared");
+    BEHAVIOR_INCREF(behavior);
+    if (boc_sched_dispatch(&behavior->bq_node) < 0) {
+      // Roll back the queue-owned reference we just took. The
+      // dispatch failure means no consumer will ever see this
+      // behavior, so no DECREF will fire from the worker side.
+      BEHAVIOR_DECREF(behavior);
       return -1;
     }
-
-    if (boc_enqueue(message) < 0) {
-      boc_message_free(message);
-      PyErr_SetString(PyExc_RuntimeError, "Message queue is full");
-      return -1;
-    }
-  }
-
-  return 0;
-}
-
-/// @brief Pre-allocate the dispatch message for the BehaviorCapsule.
-/// @details Performs every fallible operation up front so the subsequent 2PL
-/// link loop is infallible. On success, the
-/// message is stashed on behavior->start_message and consumed by the unique
-/// caller that drives behavior->count to 0 in behavior_resolve_one. On
-/// failure, no state is published -- the caller (whencall) rolls back the
-/// terminator. Dispatch goes directly to @c boc_worker carrying the
-/// bare BehaviorCapsule (no @c ("start", ...) tuple, no central scheduler hop).
-/// @param behavior The behavior to prepare
-/// @return 0 on success, -1 on failure with a Python exception set
-static int behavior_prepare_start(BOCBehavior *behavior) {
-  if (behavior->start_message != NULL) {
-    PyErr_SetString(PyExc_RuntimeError, "behavior_prepare_start called twice");
-    return -1;
-  }
-
-  // Wrap the BOCBehavior in a fresh BehaviorCapsule. The queue's XIData
-  // layer will keep this object alive until the message is consumed.
-  PyTypeObject *type = BOC_STATE->behavior_capsule_type;
-  BehaviorCapsuleObject *capsule =
-      (BehaviorCapsuleObject *)type->tp_alloc(type, 0);
-  if (capsule == NULL) {
-    return -1;
-  }
-  capsule->behavior = behavior;
-  BEHAVIOR_INCREF(behavior);
-
-  // Dispatch the BehaviorCapsule directly to a worker. Workers match
-  // ["boc_worker", behavior] and run it. The capsule is the message
-  // payload; the queue's XIData layer keeps it alive in flight.
-  PyObject *contents = (PyObject *)capsule; // borrow the new reference
-  PyObject *tag = PyUnicode_FromString("boc_worker");
-  if (tag == NULL) {
-    Py_DECREF(capsule);
-    return -1;
-  }
-
-  BOCMessage *message = boc_message_new(tag, contents);
-  Py_DECREF(capsule);
-  Py_DECREF(tag);
-  if (message == NULL) {
-    return -1;
   }
 
-  behavior->start_message = message;
   return 0;
 }
 
@@ -4730,13 +3550,16 @@ static PyObject *BehaviorCapsule_release_all(PyObject *op,
   Py_RETURN_NONE;
 }
 
-/// @brief Schedule a behavior: prepare-then-link, infallible past prepare.
-/// @details Two-phase locking entry point that consolidates create_requests,
-/// prepare_start, and the link/finish loops into one C call.
-/// All allocations happen before the first
-/// MCS link op, so failures cannot leave the cown queues in a partial
-/// state. The Python @c Behavior.schedule() collapses to a single call to
-/// this function.
+/// @brief Schedule a behavior: build requests then run the 2PL link loop.
+/// @details Two-phase locking entry point that consolidates
+/// @c create_requests and the link/finish loops into one C call.
+/// All allocations happen before the first MCS link op, so failures
+/// cannot leave the cown queues in a partial state. The Python
+/// @c Behavior.schedule() collapses to a single call to this function.
+/// Dispatch itself (the count → 0 transition in
+/// @ref behavior_resolve_one) is allocation-free and infallible:
+/// @ref boc_sched_dispatch enqueues @c &behavior->bq_node directly
+/// onto a worker's per-task queue, so there is nothing to pre-build.
 /// @param op The BehaviorCapsule to schedule
 /// @return Py_None on success, NULL on error
 static PyObject *BehaviorCapsule_schedule(PyObject *op,
@@ -4755,12 +3578,6 @@ static PyObject *BehaviorCapsule_schedule(PyObject *op,
     Py_DECREF(list);
   }
 
-  // Pre-allocate the start message. From this point onwards the link loop
-  // is infallible: no Python allocation, no callbacks.
-  if (behavior_prepare_start(behavior) < 0) {
-    return NULL;
-  }
-
   BOCRequest **requests = behavior->requests;
   Py_ssize_t n = behavior->requests_size;
 
@@ -4809,6 +3626,11 @@ static PyObject *BehaviorCapsule_schedule(PyObject *op,
   // dispatch waits for the 2PL to complete (see BehaviorCapsule_init).
   // Runs UNDER the GIL: it is the legitimate dispatcher of the start
   // message and may set a Python exception on a queue-full failure.
+  //
+  // If the resolve_one below hits the runtime-down sentinel inside
+  // @ref boc_sched_dispatch, the BOCRequest chains linked above are
+  // intentionally not unwound; see @ref behavior_resolve_one for
+  // the full rationale.
   if (behavior_resolve_one(behavior) < 0) {
     return NULL;
   }
@@ -4836,6 +3658,41 @@ static PyObject *BehaviorCapsule_set_exception(PyObject *op, PyObject *args) {
   Py_RETURN_NONE;
 }
 
+/// @brief Mark a never-executed behavior's result Cown with a drop exception.
+/// @details For behaviors drained during stop() that never had a chance to
+/// run. The result Cown is in the published-and-released state
+/// (owner=NO_OWNER, xidata=set, value=NULL) that ``Cown(None)``'s
+/// constructor leaves it in. Mirrors the worker exception path
+/// (``worker.py``: acquire → set_exception → release) but condensed into
+/// one C call: cown_acquire takes ownership on the main thread, the
+/// exception is stored, then cown_release pickles back to NO_OWNER so a
+/// caller awaiting ``cown.value`` / ``cown.exception`` after stop()
+/// observes a clear diagnostic instead of a permanent ``None``.
+/// @param op The BehaviorCapsule object
+/// @param args The exception value
+/// @return Py_None on success, NULL on error
+static PyObject *BehaviorCapsule_set_drop_exception(PyObject *op,
+                                                    PyObject *args) {
+  PyObject *value = NULL;
+
+  if (!PyArg_ParseTuple(args, "O", &value)) {
+    return NULL;
+  }
+
+  BehaviorCapsuleObject *self = (BehaviorCapsuleObject *)op;
+  BOCBehavior *behavior = self->behavior;
+
+  if (cown_acquire(behavior->result) < 0) {
+    return NULL;
+  }
+  cown_set_value(behavior->result, value);
+  behavior->result->exception = true;
+  if (cown_release(behavior->result) < 0) {
+    return NULL;
+  }
+  Py_RETURN_NONE;
+}
+
 static int acquire_vars(BOCCown **vars, Py_ssize_t size) {
   BOCCown **ptr = vars;
   for (Py_ssize_t i = 0; i < size; ++i, ++ptr) {
@@ -5044,6 +3901,8 @@ static PyObject *BehaviorCapsule_execute(PyObject *op, PyObject *args) {
 
 static PyMethodDef BehaviorCapsule_methods[] = {
     {"set_exception", BehaviorCapsule_set_exception, METH_VARARGS, NULL},
+    {"set_drop_exception", BehaviorCapsule_set_drop_exception, METH_VARARGS,
+     NULL},
     {"acquire", BehaviorCapsule_acquire, METH_NOARGS, NULL},
     {"release", BehaviorCapsule_release, METH_NOARGS, NULL},
     {"release_all", BehaviorCapsule_release_all, METH_NOARGS, NULL},
@@ -5208,11 +4067,13 @@ static int request_release_inner(BOCRequest *request) {
 /// @c request_release_inner helper above is what walks the MCS queue.
 
 /// @brief Enqueue body called by @c behavior_schedule.
-/// @details Pure C, no Python allocation, no exception. The only failure
-/// surface is propagated by behavior_resolve_one (which can fail if the
-/// queue is full); we return its NULL/non-NULL via int. Callers that have
-/// already pre-allocated the start message via behavior_prepare_start can
-/// treat this as infallible from the link-loop perspective.
+/// @details Pure C, no Python allocation. The only failure surface
+/// is propagated by @ref behavior_resolve_one, which forwards a
+/// dispatch failure from @ref boc_sched_dispatch (e.g. the runtime
+/// was torn down between the caller's @c terminator_inc and our
+/// dispatch). On failure a Python exception is set and the link
+/// loop's caller is expected to roll back its terminator hold;
+/// see @c whencall in @c behaviors.py.
 /// @param request The request to enqueue
 /// @param behavior The behavior owning the request
 /// @return 0 on success, -1 on error with a Python exception set
@@ -5248,7 +4109,17 @@ static int request_start_enqueue_inner(BOCRequest *request,
   atomic_store_intptr(&prev->next, behavior_ptr);
   PRINTDBG("request->next = bid=%" PRIdLEAST64 "\n", behavior->id);
   BEHAVIOR_INCREF(behavior);
-  // wait for the previous request to be scheduled
+  // Order note: bocpy stores prev->next BEFORE spinning on
+  // prev->scheduled, the opposite of Verona's Slot::set_next which
+  // observes the predecessor's scheduled flag first. The inversion
+  // is safe because (a) the prev->rc++ above keeps prev alive across
+  // the window where prev's owning behavior may run release_all
+  // concurrently once prev->next is published, preventing the UAF
+  // such ordering would otherwise admit (see the rc-comment block
+  // above); and (b) the behavior dispatch invariant ensures no
+  // successor can run user code until ALL its requests have
+  // completed phase 2 (request_finish_enqueue_inner), so the
+  // predecessor cannot retire the chain prematurely while we spin.
   while (true) {
     if (atomic_load(&prev->scheduled)) {
       break;
@@ -5380,10 +4251,13 @@ static PyObject *_core_set_tags(PyObject *module, PyObject *args) {
       return NULL;
     }
 
-    // assign a new tag
+    // assign a new tag. tag_from_PyUnicode returned with rc=1, which
+    // is exactly the queue's owning reference — no extra TAG_INCREF
+    // is needed here. The previously-installed tag (if any) is
+    // disabled and released so any in-flight messages still holding
+    // owning refs to it can complete and free the tag when done.
     BOCTag *oldtag =
         (BOCTag *)atomic_exchange_intptr(&qptr->tag, (intptr_t)qtag);
-    TAG_INCREF(qtag);
     if (oldtag != NULL) {
       tag_disable(oldtag);
       TAG_DECREF(oldtag);
@@ -5560,6 +4434,431 @@ static PyObject *_core_cown_pin_pointers(PyObject *module, PyObject *args) {
   return NULL;
 }
 
+/// @brief Snapshot the per-worker scheduler counters.
+/// @details Returns one dict per worker carrying the @ref
+/// boc_sched_stats_t fields, or an empty list when the runtime is
+/// down (no workers allocated). Reads are best-effort
+/// (memory_order_relaxed): values are monotonic counters, so a torn
+/// read can only under-report.
+/// @param module The _core module
+/// @param Py_UNUSED
+/// @return A list of per-worker stats dicts, or NULL on error.
+static PyObject *_core_scheduler_stats(PyObject *Py_UNUSED(module),
+                                       PyObject *Py_UNUSED(dummy)) {
+  Py_ssize_t n = boc_sched_worker_count();
+  PyObject *result = PyList_New(n);
+  if (result == NULL) {
+    return NULL;
+  }
+  for (Py_ssize_t i = 0; i < n; ++i) {
+    boc_sched_stats_t s;
+    if (boc_sched_stats_snapshot(i, &s) < 0) {
+      PyErr_SetString(PyExc_RuntimeError, "boc_sched_stats_snapshot failed");
+      Py_DECREF(result);
+      return NULL;
+    }
+    PyObject *d = Py_BuildValue(
+        "{s:n,s:K,s:K,s:K,s:K,s:K,s:K,s:K,s:K,s:K,s:K,s:K,s:K,s:K}",
+        "worker_index", i, "pushed_local", (unsigned long long)s.pushed_local,
+        "dispatched_to_pending", (unsigned long long)s.dispatched_to_pending,
+        "pushed_remote", (unsigned long long)s.pushed_remote, "popped_local",
+        (unsigned long long)s.popped_local, "popped_via_steal",
+        (unsigned long long)s.popped_via_steal, "enqueue_cas_retries",
+        (unsigned long long)s.enqueue_cas_retries, "dequeue_cas_retries",
+        (unsigned long long)s.dequeue_cas_retries, "batch_resets",
+        (unsigned long long)s.batch_resets, "steal_attempts",
+        (unsigned long long)s.steal_attempts, "steal_failures",
+        (unsigned long long)s.steal_failures, "parked",
+        (unsigned long long)s.parked, "last_steal_attempt_ns",
+        (unsigned long long)s.last_steal_attempt_ns, "fairness_arm_fires",
+        (unsigned long long)s.fairness_arm_fires);
+    if (d == NULL) {
+      Py_DECREF(result);
+      return NULL;
+    }
+    PyList_SET_ITEM(result, i, d); // steals ref
+  }
+  return result;
+}
+
+/// @brief Snapshot the per-tagged-queue contention counters.
+/// @details Returns one dict per assigned BOCQueue (state ==
+/// BOC_QUEUE_ASSIGNED) carrying the four @c memory_order_relaxed
+/// counters bumped by @c boc_enqueue / @c boc_dequeue. Unassigned
+/// queues are skipped because their tag is NULL.
+/// @param module The _core module
+/// @param Py_UNUSED
+/// @return A list of per-queue stats dicts, or NULL on error.
+static PyObject *_core_queue_stats(PyObject *Py_UNUSED(module),
+                                   PyObject *Py_UNUSED(dummy)) {
+  PyObject *result = PyList_New(0);
+  if (result == NULL) {
+    return NULL;
+  }
+  BOCQueue *qptr = BOC_QUEUES;
+  for (size_t i = 0; i < BOC_QUEUE_COUNT; ++i, ++qptr) {
+    int_least64_t state =
+        atomic_load_explicit(&qptr->state, memory_order_relaxed);
+    if (state != BOC_QUEUE_ASSIGNED) {
+      continue;
+    }
+    BOCTag *tag =
+        (BOCTag *)atomic_load_explicit(&qptr->tag, memory_order_relaxed);
+    PyObject *tag_obj;
+    if (tag != NULL && tag->str != NULL) {
+      tag_obj = PyUnicode_FromString(tag->str);
+      if (tag_obj == NULL) {
+        Py_DECREF(result);
+        return NULL;
+      }
+    } else {
+      tag_obj = Py_NewRef(Py_None);
+    }
+    uint64_t enq_r = boc_atomic_load_u64_explicit(&qptr->enqueue_cas_retries,
+                                                  BOC_MO_RELAXED);
+    uint64_t deq_r = boc_atomic_load_u64_explicit(&qptr->dequeue_cas_retries,
+                                                  BOC_MO_RELAXED);
+    uint64_t pushed =
+        boc_atomic_load_u64_explicit(&qptr->pushed_total, BOC_MO_RELAXED);
+    uint64_t popped =
+        boc_atomic_load_u64_explicit(&qptr->popped_total, BOC_MO_RELAXED);
+    PyObject *d = Py_BuildValue(
+        "{s:n,s:N,s:K,s:K,s:K,s:K}", "queue_index", (Py_ssize_t)qptr->index,
+        "tag", tag_obj, // steals ref
+        "enqueue_cas_retries", (unsigned long long)enq_r, "dequeue_cas_retries",
+        (unsigned long long)deq_r, "pushed_total", (unsigned long long)pushed,
+        "popped_total", (unsigned long long)popped);
+    if (d == NULL) {
+      Py_DECREF(result);
+      return NULL;
+    }
+    if (PyList_Append(result, d) < 0) {
+      Py_DECREF(d);
+      Py_DECREF(result);
+      return NULL;
+    }
+    Py_DECREF(d);
+  }
+  return result;
+}
+
+/// @brief Initialise the scheduler runtime for a fresh start cycle.
+/// @details Tears down any previous per-worker array, then allocates
+/// a new one of the requested size and resets the registration
+/// counter. Called by @c behaviors.start() exactly once per
+/// `start()`/`wait()`/`start()` cycle, before worker sub-interpreters
+/// are spawned. Idempotent in the down state.
+/// @param module The _core module
+/// @param arg PyLong worker_count (must be >= 0)
+/// @return Py_None on success, NULL with an exception on failure.
+static PyObject *_core_scheduler_runtime_start(PyObject *Py_UNUSED(module),
+                                               PyObject *arg) {
+  long long n = PyLong_AsLongLong(arg);
+  if (n == -1 && PyErr_Occurred()) {
+    return NULL;
+  }
+  if (n < 0) {
+    PyErr_SetString(PyExc_ValueError,
+                    "scheduler_runtime_start: worker_count must be >= 0");
+    return NULL;
+  }
+  // Idempotent shutdown: safe whether or not a previous cycle ran.
+  boc_sched_shutdown();
+  if (boc_sched_init((Py_ssize_t)n) < 0) {
+    return NULL; // exception already set
+  }
+
+  // Allocate one fairness-token BOCBehavior per worker. Tokens
+  // are zero-initialised so every refcount / cown-array field is the
+  // safe NULL state, and `is_token = 1` discriminates them at the
+  // worker-pop site. Allocation lives here (and not in
+  // `boc_sched_init`) because `sched.c` deliberately treats
+  // `BOCBehavior` as opaque.
+  for (Py_ssize_t i = 0; i < (Py_ssize_t)n; ++i) {
+    BOCBehavior *token = (BOCBehavior *)PyMem_RawCalloc(1, sizeof(BOCBehavior));
+    if (token == NULL) {
+      // Roll back any tokens already installed and tear the runtime
+      // back down so the caller sees a clean failure (no half-init).
+      for (Py_ssize_t j = 0; j < i; ++j) {
+        BOCBehavior *prev = NULL;
+        boc_bq_node_t *prev_node = boc_sched_get_token_node(j);
+        if (prev_node != NULL) {
+          prev = BEHAVIOR_FROM_BQ_NODE(prev_node);
+        }
+        boc_sched_set_token_node(j, NULL);
+        if (prev != NULL) {
+          PyMem_RawFree(prev);
+        }
+      }
+      boc_sched_shutdown();
+      PyErr_NoMemory();
+      return NULL;
+    }
+    // Mark as token. PyMem_RawCalloc has zeroed everything (NULL
+    // thunk/result/args/captures/requests, count == rc == 0,
+    // bq_node.next_in_queue == NULL). The behaviour is never
+    // reference-counted via BEHAVIOR_INCREF/DECREF and never visits
+    // the request/cown machinery; it is recycled in place by the
+    // token re-enqueue path. We give it an `id` of -1 so any
+    // diagnostic that prints `behavior->id` for a token is
+    // immediately recognisable.
+    token->is_token = 1;
+    token->id = -1;
+    token->owner_worker_index = (int16_t)i;
+    if (boc_sched_set_token_node(i, &token->bq_node) < 0) {
+      // worker_index out of range: only possible if WORKER_COUNT
+      // changed under us, which the GIL precludes. Defensive.
+      PyMem_RawFree(token);
+      boc_sched_shutdown();
+      PyErr_SetString(PyExc_RuntimeError,
+                      "scheduler_runtime_start: token install failed");
+      return NULL;
+    }
+    // Lazy bootstrap (Verona-faithful): we do NOT enqueue the token
+    // onto the worker's queue here. The worker's
+    // `should_steal_for_fairness` flag is already initialised to
+    // true by `boc_sched_init` (mirrors Verona `core.h:23` —
+    // `should_steal_for_fairness{true}`). The first time the worker
+    // has a non-empty queue and calls `pop_fast`, the fairness gate
+    // routes through `pop_slow`, whose arm re-enqueues this token
+    // from `self->token_work`. From then on the heartbeat is alive
+    // and self-sustaining: every owner-side fairness arm fire
+    // re-enqueues the token, and every token consumption (by owner
+    // or thief) sets the owner's flag back to true via the dispatch
+    // loop in `_core_scheduler_worker_pop`.
+  }
+
+  Py_RETURN_NONE;
+}
+
+/// @brief Tear down the scheduler runtime at the end of a start cycle.
+/// @details Frees the per-worker array and resets the registration
+/// counter. Idempotent. Called by @c behaviors.stop_workers after the
+/// worker threads have been joined.
+/// @param module The _core module
+/// @param Py_UNUSED
+/// @return Py_None.
+static PyObject *_core_scheduler_runtime_stop(PyObject *Py_UNUSED(module),
+                                              PyObject *Py_UNUSED(dummy)) {
+  // Recover and free per-worker fairness tokens before
+  // `boc_sched_shutdown` frees the worker array. Each token is a
+  // bare `BOCBehavior` allocated by `_core_scheduler_runtime_start`
+  // via PyMem_RawCalloc; it never goes through behavior_free /
+  // BEHAVIOR_DECREF (zero refcount, no captured cowns).
+  Py_ssize_t worker_count = boc_sched_worker_count();
+  for (Py_ssize_t i = 0; i < worker_count; ++i) {
+    boc_bq_node_t *node = boc_sched_get_token_node(i);
+    if (node == NULL) {
+      continue;
+    }
+    BOCBehavior *token = BEHAVIOR_FROM_BQ_NODE(node);
+    boc_sched_set_token_node(i, NULL);
+    PyMem_RawFree(token);
+  }
+  boc_sched_shutdown();
+  Py_RETURN_NONE;
+}
+
+/// @brief Atomically claim a worker slot for the calling thread.
+/// @details Wraps @ref boc_sched_worker_register. Returns the
+/// assigned slot index (0..worker_count-1) on success. Raises
+/// @c RuntimeError if no free slot remains (over-registration: more
+/// callers than @c boc_sched_init was given).
+/// @param module The _core module
+/// @param Py_UNUSED
+/// @return PyLong slot index, or NULL with RuntimeError set.
+static PyObject *_core_scheduler_worker_register(PyObject *Py_UNUSED(module),
+                                                 PyObject *Py_UNUSED(dummy)) {
+  Py_ssize_t slot = boc_sched_worker_register();
+  if (slot < 0) {
+    PyErr_SetString(
+        PyExc_RuntimeError,
+        "scheduler_worker_register: no free worker slot (over-registration)");
+    return NULL;
+  }
+  return PyLong_FromSsize_t(slot);
+}
+
+/// @brief Set @c stop_requested on every worker and wake them all.
+/// @details Wraps @ref boc_sched_worker_request_stop_all. Idempotent.
+/// Production callers: @c behaviors.stop_workers and
+/// @c Behaviors.terminator_callback (see @c src/bocpy/behaviors.py).
+/// @param module The _core module
+/// @param Py_UNUSED
+/// @return Py_None.
+static PyObject *_core_scheduler_request_stop_all(PyObject *Py_UNUSED(module),
+                                                  PyObject *Py_UNUSED(dummy)) {
+  boc_sched_worker_request_stop_all();
+  Py_RETURN_NONE;
+}
+
+/// @brief Wait for the next behaviour and return it as a BehaviorCapsule.
+/// @details The production consumer entry point. The calling
+/// thread must already be registered with @ref boc_sched_worker_register
+/// (the worker bootstrap calls that on entry to @c do_work).
+///
+/// Hot path: @ref boc_sched_worker_pop_fast — pending or own queue.
+/// Drops to @ref boc_sched_worker_pop_slow which parks under the
+/// worker's @c cv until @ref boc_sched_dispatch wakes it or
+/// @ref boc_sched_worker_request_stop_all flips @c stop_requested.
+/// Returns @c None when @c pop_slow returns NULL (stop signal); the
+/// worker treats that as the loop-exit condition.
+///
+/// **Refcount transfer.** The producer in
+/// @c behavior_resolve_one calls @c BEHAVIOR_INCREF before
+/// @c boc_sched_dispatch, taking a queue-owned reference. This
+/// function consumes that reference and installs it in the freshly
+/// allocated @c BehaviorCapsule. The capsule's @c tp_dealloc runs
+/// @c BEHAVIOR_DECREF on the worker side, balancing the producer
+/// side. Do not @c BEHAVIOR_INCREF here.
+///
+/// **GIL.** The slow arm releases the GIL across @c cnd_wait
+/// internally (see @ref boc_sched_worker_pop_slow). This wrapper
+/// therefore needs no surrounding @c Py_BEGIN_ALLOW_THREADS — the
+/// only blocking syscall is wrapped at the C layer.
+///
+/// **Allocation failure.** If @c tp_alloc fails after a successful
+/// pop, the popped behaviour is leaked (its queue-owned reference
+/// is never balanced). This is a defensive path that requires
+/// memory exhaustion mid-dispatch; logging-and-returning-None would
+/// hide the leak. We surface the @c PyErr_NoMemory and let the
+/// worker's exception handler log it; the leak is preferable to a
+/// double-free.
+/// @param module The _core module
+/// @param Py_UNUSED unused arg
+/// @return Fresh BehaviorCapsule, or @c None on shutdown. NULL on
+///         error with a Python exception set.
+static PyObject *_core_scheduler_worker_pop(PyObject *Py_UNUSED(module),
+                                            PyObject *Py_UNUSED(dummy)) {
+  boc_sched_worker_t *self = boc_sched_current_worker();
+  if (self == NULL) {
+    PyErr_SetString(PyExc_RuntimeError,
+                    "scheduler_worker_pop: thread not registered");
+    return NULL;
+  }
+  // Token-loop. Mirrors Verona `SchedulerThread::run_inner`
+  // (`schedulerthread.h`), which dequeues a `Work*`, executes its
+  // closure, and loops back if the closure was the per-Core
+  // `token_work`. bocpy keeps the loop here (rather than inside
+  // `boc_sched_worker_pop_*`) so the sched TU stays opaque to
+  // `BOCBehavior` layout: only this TU knows how to dereference
+  // `is_token`. The token's "thunk" body is the C-side helper
+  // `boc_sched_set_steal_flag(self, true)` — same effect as the
+  // Verona closure at `core.h:28-32`.
+  BOCBehavior *behavior;
+  for (;;) {
+    boc_bq_node_t *n = boc_sched_worker_pop_fast(self);
+    if (n == NULL) {
+      n = boc_sched_worker_pop_slow(self);
+      if (n == NULL) {
+        // pop_slow returns NULL only when stop_requested is set.
+        Py_RETURN_NONE;
+      }
+    }
+    behavior = BEHAVIOR_FROM_BQ_NODE(n);
+    if (!behavior->is_token) {
+      break;
+    }
+    // Token sentinel: set the OWNING worker's fairness flag, not
+    // ours. The token may have been stolen and is now running on
+    // a thief — but the heartbeat must report back to the owner so
+    // the owner's `pop_slow` fairness arm fires next time it has
+    // local work, re-enqueueing the token from the owner's
+    // `self->token_work` slot. Verona achieves the same effect by
+    // capturing the owning core's `this` in `Closure::make`
+    // (`core.h:24-32`); we use an explicit `owner_worker_index`
+    // field on the token because closures are not free in C.
+    //
+    // The token's `bq_node` is dropped here (NOT re-enqueued by
+    // this thread). The owner's slow-path arm is the only place
+    // that ever re-enqueues a token, and it always uses its own
+    // `token_work` slot — so the bq_node is owner-owned and
+    // single-producer for re-enqueue purposes (no cross-thread
+    // double-enqueue risk).
+    boc_sched_worker_t *owner =
+        boc_sched_worker_at(behavior->owner_worker_index);
+    boc_sched_set_steal_flag(owner, true);
+  }
+  PyTypeObject *type = BOC_STATE->behavior_capsule_type;
+  BehaviorCapsuleObject *capsule =
+      (BehaviorCapsuleObject *)type->tp_alloc(type, 0);
+  if (capsule == NULL) {
+    return NULL;
+  }
+  // Transfer the queue-owned reference into the capsule. Do NOT
+  // BEHAVIOR_INCREF: the producer already incref'd before dispatch.
+  capsule->behavior = behavior;
+  return (PyObject *)capsule;
+}
+
+/// @brief Drain every per-worker queue and return the behaviours
+///        as a list of @c BehaviorCapsule objects.
+/// @details Used by @c behaviors.stop_workers after the worker
+/// threads have joined but before @c scheduler_runtime_stop frees
+/// the worker array. Each worker's @c bq_t is repeatedly dequeued
+/// until empty; each popped node is wrapped in a fresh
+/// @c BehaviorCapsule, transferring the queue-owned reference (no
+/// extra @c BEHAVIOR_INCREF). The Python caller then runs
+/// @c release_all on each capsule to unwind MCS chains and drop the
+/// terminator hold the original @c whencall took. Calling this with
+/// the runtime down (@c WORKER_COUNT == 0) returns an empty list.
+///
+/// **Why not in @c boc_sched_shutdown.** Releasing the cown chain\n/// requires
+/// Python-level @c release_all (it touches @c BOCRequest\n/// arrays whose
+/// freeing routes through @c COWN_DECREF). Doing this\n/// in C without the GIL
+/// would also deadlock against any pending\n/// noticeboard mutator. The Python
+/// orchestration layer is the right\n/// place to coordinate.\n/// @param
+/// module The _core module\n/// @param Py_UNUSED unused arg\n/// @return Fresh
+/// @c list[BehaviorCapsule] (possibly empty), or NULL\n///         on
+/// allocation failure with a Python exception set.
+static PyObject *_core_scheduler_drain_all_queues(PyObject *Py_UNUSED(module),
+                                                  PyObject *Py_UNUSED(dummy)) {
+  PyObject *out = PyList_New(0);
+  if (out == NULL) {
+    return NULL;
+  }
+  Py_ssize_t worker_count = boc_sched_worker_count();
+  PyTypeObject *type = BOC_STATE->behavior_capsule_type;
+  for (Py_ssize_t i = 0; i < worker_count; ++i) {
+    boc_sched_worker_t *w = boc_sched_worker_at(i);
+    if (w == NULL) {
+      continue;
+    }
+    for (;;) {
+      boc_bq_node_t *n = boc_wsq_dequeue(w);
+      if (n == NULL) {
+        break;
+      }
+      BOCBehavior *behavior = BEHAVIOR_FROM_BQ_NODE(n);
+      if (behavior->is_token) {
+        // Token sentinels are not reference-counted and own no
+        // cowns; they live in the per-worker `token_work` slot and
+        // are freed by `_core_scheduler_runtime_stop`. Skip them
+        // here so we don't hand a token to the Python release-all
+        // path (which would dereference NULL request arrays).
+        continue;
+      }
+      BehaviorCapsuleObject *capsule =
+          (BehaviorCapsuleObject *)type->tp_alloc(type, 0);
+      if (capsule == NULL) {
+        // Rebalance the queue-owned reference we just popped before
+        // bailing — otherwise the behaviour leaks.
+        BEHAVIOR_DECREF(behavior);
+        Py_DECREF(out);
+        return NULL;
+      }
+      capsule->behavior = behavior; // ref transferred in
+      if (PyList_Append(out, (PyObject *)capsule) < 0) {
+        Py_DECREF(capsule);
+        Py_DECREF(out);
+        return NULL;
+      }
+      Py_DECREF(capsule); // list owns it now
+    }
+  }
+  return out;
+}
+
 static PyMethodDef _core_module_methods[] = {
     {"send", _core_send, METH_VARARGS,
      "send($module, tag, contents, /)\n--\n\nSends a message."},
@@ -5575,6 +4874,34 @@ static PyMethodDef _core_module_methods[] = {
     {"cowns", _core_cowns, METH_NOARGS, NULL},
     {"set_tags", _core_set_tags, METH_VARARGS,
      "set_tags($module, tags, /)\n--\n\nAssigns tags to message queues."},
+    {"scheduler_stats", _core_scheduler_stats, METH_NOARGS,
+     "scheduler_stats($module, /)\n--\n\n"
+     "Snapshot of per-worker scheduler counters (one dict per worker; "
+     "empty list when the runtime is down)."},
+    {"queue_stats", _core_queue_stats, METH_NOARGS,
+     "queue_stats($module, /)\n--\n\n"
+     "Snapshot of per-tagged-queue contention counters."},
+    {"scheduler_runtime_start", _core_scheduler_runtime_start, METH_O,
+     "scheduler_runtime_start($module, worker_count, /)\n--\n\n"
+     "Allocate the per-worker scheduler array. Called by behaviors.start()."},
+    {"scheduler_runtime_stop", _core_scheduler_runtime_stop, METH_NOARGS,
+     "scheduler_runtime_stop($module, /)\n--\n\n"
+     "Free the per-worker scheduler array. Called by behaviors.stop_workers."},
+    {"scheduler_worker_register", _core_scheduler_worker_register, METH_NOARGS,
+     "scheduler_worker_register($module, /)\n--\n\n"
+     "Claim the next free worker slot for the calling thread. "
+     "Raises RuntimeError on over-registration."},
+    {"scheduler_request_stop_all", _core_scheduler_request_stop_all,
+     METH_NOARGS,
+     "scheduler_request_stop_all($module, /)\n--\n\n"
+     "Set stop_requested on every worker and wake them all."},
+    {"scheduler_worker_pop", _core_scheduler_worker_pop, METH_NOARGS,
+     "scheduler_worker_pop($module, /)\n--\n\n"
+     "Wait for and return the next BehaviorCapsule, or None on shutdown."},
+    {"scheduler_drain_all_queues", _core_scheduler_drain_all_queues,
+     METH_NOARGS,
+     "scheduler_drain_all_queues($module, /)\n--\n\n"
+     "Drain every per-worker queue. Returns list[BehaviorCapsule]."},
     {"_cown_capsule_from_pointer", _cown_capsule_from_pointer, METH_VARARGS,
      NULL},
     {"cown_pin_pointers", _core_cown_pin_pointers, METH_VARARGS,
@@ -5651,36 +4978,86 @@ static int _core_module_exec(PyObject *module) {
       qptr->index = i;
       qptr->messages =
           (BOCMessage **)PyMem_RawCalloc(BOC_CAPACITY, sizeof(BOCMessage *));
+      if (qptr->messages == NULL) {
+        // Unwind the queues we already initialised. boc_park_init has
+        // been called for indices [0, i); any messages buffer they hold
+        // must be freed.
+        for (size_t j = 0; j < i; ++j) {
+          PyMem_RawFree(BOC_QUEUES[j].messages);
+          BOC_QUEUES[j].messages = NULL;
+          boc_park_destroy(&BOC_QUEUES[j]);
+        }
+        atomic_fetch_sub(&BOC_COUNT, 1);
+        PyErr_NoMemory();
+        return -1;
+      }
       memset(qptr->messages, 0, BOC_CAPACITY * sizeof(BOCMessage *));
       qptr->head = 0;
       qptr->tail = 0;
       qptr->state = BOC_QUEUE_UNASSIGNED;
       qptr->tag = 0;
       qptr->waiters = 0;
+      boc_atomic_store_u64_explicit(&qptr->enqueue_cas_retries, 0,
+                                    BOC_MO_RELAXED);
+      boc_atomic_store_u64_explicit(&qptr->dequeue_cas_retries, 0,
+                                    BOC_MO_RELAXED);
+      boc_atomic_store_u64_explicit(&qptr->pushed_total, 0, BOC_MO_RELAXED);
+      boc_atomic_store_u64_explicit(&qptr->popped_total, 0, BOC_MO_RELAXED);
       boc_park_init(qptr);
     }
 
     BOCRecycleQueue *queue_stub =
         (BOCRecycleQueue *)PyMem_RawMalloc(sizeof(BOCRecycleQueue));
+    if (queue_stub == NULL) {
+      // Unwind every queue.
+      for (size_t i = 0; i < BOC_QUEUE_COUNT; ++i) {
+        PyMem_RawFree(BOC_QUEUES[i].messages);
+        BOC_QUEUES[i].messages = NULL;
+        boc_park_destroy(&BOC_QUEUES[i]);
+      }
+      atomic_fetch_sub(&BOC_COUNT, 1);
+      PyErr_NoMemory();
+      return -1;
+    }
     queue_stub->head = 0;
     queue_stub->tail = NULL;
     queue_stub->next = 0;
     atomic_store_intptr(&BOC_RECYCLE_QUEUE_HEAD, (intptr_t)queue_stub);
     BOC_RECYCLE_QUEUE_TAIL = queue_stub;
 
-    // Initialize the noticeboard
-    memset(&NB, 0, sizeof(NB));
-    boc_mtx_init(&NB.mutex);
-
-    // Initialize the notice_sync barrier primitives.
-    boc_mtx_init(&NB_SYNC_MUTEX);
-    cnd_init(&NB_SYNC_COND);
+    // Initialize the noticeboard subsystem (mutex + sync primitives).
+    // noticeboard_init / terminator_init currently return void; if
+    // they ever start failing, this site will need to propagate the
+    // error through `_core_module_exec`.
+    noticeboard_init();
 
     // Initialize the terminator primitives.
     // The Pyrona seed (count=1, seeded=1) is set by terminator_reset()
     // when the runtime starts; here we only initialize the kernel objects.
-    boc_mtx_init(&TERMINATOR_MUTEX);
-    cnd_init(&TERMINATOR_COND);
+    terminator_init();
+
+    // Initialize the scheduler module with no workers. The
+    // per-worker array stays unallocated and `_core.scheduler_stats()`
+    // returns an empty list until `behaviors.start()` calls
+    // `scheduler_runtime_start` with the real worker count.
+    if (boc_sched_init(0) < 0) {
+      // Unwind every globally-allocated subsystem before returning -1
+      // so that the BOC_COUNT == 0 invariant ("first interpreter has
+      // not yet completed module init") is restored.
+      noticeboard_destroy();
+      // terminator currently has no destroy entry point; its kernel
+      // objects (mutex + cv) are reusable across init/destroy cycles.
+      PyMem_RawFree((void *)BOC_RECYCLE_QUEUE_TAIL);
+      BOC_RECYCLE_QUEUE_TAIL = NULL;
+      atomic_store_intptr(&BOC_RECYCLE_QUEUE_HEAD, 0);
+      for (size_t i = 0; i < BOC_QUEUE_COUNT; ++i) {
+        PyMem_RawFree(BOC_QUEUES[i].messages);
+        BOC_QUEUES[i].messages = NULL;
+        boc_park_destroy(&BOC_QUEUES[i]);
+      }
+      atomic_fetch_sub(&BOC_COUNT, 1);
+      return -1;
+    }
 
 #ifdef BOC_REF_TRACKING
 #ifdef _WIN32
@@ -5762,27 +5139,41 @@ static int _core_module_exec(PyObject *module) {
 static int _core_module_clear(PyObject *module) {
   PRINTDBG("_core_module_clear\n");
   _core_module_state *state = (_core_module_state *)PyModule_GetState(module);
+  if (state == NULL) {
+    return 0;
+  }
   Py_CLEAR(state->loads);
   Py_CLEAR(state->dumps);
   Py_CLEAR(state->pickle);
   Py_CLEAR(state->cown_capsule_type);
   Py_CLEAR(state->behavior_capsule_type);
-  // this needs to be cleared here, as it was allocated on this interpreter.
-  Py_CLEAR(state->recycle_queue->xidata_to_cowns);
+  // The recycle_queue is allocated late in module_exec; it may be NULL if
+  // module_exec returned -1 before reaching BOCRecycleQueue_new(). The
+  // worker recycle queue's xidata_to_cowns dict is owned by this
+  // interpreter and must be cleared here so the GC can collect any
+  // reference cycles anchored through it.
+  if (state->recycle_queue != NULL) {
+    Py_CLEAR(state->recycle_queue->xidata_to_cowns);
+  }
   // Clear the thread-local snapshot cache so the GC can collect any
   // reference cycles anchored through the cached dict / proxy.
-  nb_drop_local_cache();
+  noticeboard_drop_local_cache();
   return 0;
 }
 
 void _core_module_free(void *module_ptr) {
   PyObject *module = (PyObject *)module_ptr;
   _core_module_state *state = (_core_module_state *)PyModule_GetState(module);
+  if (state == NULL) {
+    return;
+  }
 
   PRINTDBG("begin boc_free(index=%" PRIdLEAST64 ")\n", state->index);
   PRINTDBG("Emptying _core recycle queue...\n");
 
-  BOCRecycleQueue_empty(state->recycle_queue, true);
+  if (state->recycle_queue != NULL) {
+    BOCRecycleQueue_empty(state->recycle_queue, true);
+  }
 
   _core_module_clear(module);
   for (size_t i = 0; i < BOC_QUEUE_COUNT; ++i) {
@@ -5819,29 +5210,12 @@ void _core_module_free(void *module_ptr) {
     BOC_RECYCLE_QUEUE_TAIL = NULL;
     atomic_store_intptr(&BOC_RECYCLE_QUEUE_HEAD, 0);
 
-    // Clear the thread-local snapshot cache before freeing entries
-    Py_CLEAR(NB_SNAPSHOT_CACHE);
-
-    // Collect noticeboard entries to free after releasing the mutex.
-    XIDATA_T *nb_to_free[NB_MAX_ENTRIES];
-    int nb_to_free_count = 0;
+    // Tear down the noticeboard subsystem (snapshot cache, entries,
+    // pins, mutex, sync primitives).
+    noticeboard_destroy();
 
-    mtx_lock(&NB.mutex);
-    for (int i = 0; i < NB.count; i++) {
-      if (NB.entries[i].value != NULL) {
-        nb_to_free[nb_to_free_count++] = NB.entries[i].value;
-        NB.entries[i].value = NULL;
-      }
-    }
-    NB.count = 0;
-    mtx_unlock(&NB.mutex);
-
-    for (int i = 0; i < nb_to_free_count; i++) {
-      XIDATA_FREE(nb_to_free[i]);
-    }
-
-    // Destroy noticeboard mutex
-    mtx_destroy(&NB.mutex);
+    // Tear down the scheduler instrumentation skeleton.
+    boc_sched_shutdown();
 
     BOC_REF_TRACKING_REPORT();
   }
@@ -5851,12 +5225,19 @@ void _core_module_free(void *module_ptr) {
 
 static int _core_module_traverse(PyObject *module, visitproc visit, void *arg) {
   _core_module_state *state = (_core_module_state *)PyModule_GetState(module);
+  if (state == NULL) {
+    return 0;
+  }
   Py_VISIT(state->loads);
   Py_VISIT(state->dumps);
   Py_VISIT(state->pickle);
   Py_VISIT(state->cown_capsule_type);
   Py_VISIT(state->behavior_capsule_type);
-  Py_VISIT(state->recycle_queue->xidata_to_cowns);
+  // recycle_queue is allocated late in module_exec; if exec failed before
+  // reaching BOCRecycleQueue_new() the field is still NULL.
+  if (state->recycle_queue != NULL) {
+    Py_VISIT(state->recycle_queue->xidata_to_cowns);
+  }
   return 0;
 }
 
diff --git a/src/bocpy/_core.pyi b/src/bocpy/_core.pyi
new file mode 100644
index 0000000..7c39e6b
--- /dev/null
+++ b/src/bocpy/_core.pyi
@@ -0,0 +1,66 @@
+"""Type stubs for private :mod:`bocpy._core` accessors.
+
+Public re-exports are stubbed in :mod:`bocpy.__init__`; this file
+only covers the private accessors used by the test suite and
+internal tooling.
+"""
+
+from typing import Any
+
+
+def scheduler_stats() -> list[dict[str, Any]]:
+    """Snapshot the per-worker scheduler counters.
+
+    Returns ``[]`` when the scheduler runtime is down (no workers
+    allocated) -- this includes the window between :func:`bocpy.wait`
+    returning and the next :func:`bocpy.start` / ``@when`` call. To
+    capture a snapshot for a session that has just ended, use
+    :func:`bocpy.wait` with ``stats=True``.
+
+    When the runtime is up, returns a list with one dict per worker,
+    each carrying the fields ``worker_index``, ``pushed_local``,
+    ``dispatched_to_pending``, ``pushed_remote``, ``popped_local``,
+    ``popped_via_steal``, ``enqueue_cas_retries``,
+    ``dequeue_cas_retries``, ``batch_resets``, ``steal_attempts``,
+    ``steal_failures``, ``parked``, ``last_steal_attempt_ns``, and
+    ``fairness_arm_fires``.
+
+    Counter semantics:
+
+    * ``pushed_local`` / ``dispatched_to_pending`` / ``pushed_remote``
+      record this worker's *role as producer*: they are bumped when
+      this worker dispatches a behaviour (locally, into the empty
+      ``pending`` slot, or onto another worker). They are **not**
+      bumped when nodes arrive via a thief's
+      ``boc_wsq_enqueue_spread`` re-distribution -- the global
+      reconciliation ``Σ pushed_* == Σ popped_*`` holds across all
+      workers, but the per-worker ``pushed_* − popped_*`` is **not**
+      a local queue-depth estimate.
+    * ``parked`` counts cumulative entries to the ``cnd_wait`` park
+      arm.
+    * ``last_steal_attempt_ns`` is a monotonic timestamp (ns; zero
+      if the worker has never attempted a steal) of this worker's
+      most recent steal attempt.
+    * ``fairness_arm_fires`` counts the times this worker actually
+      honoured ``should_steal_for_fairness`` (flag set AND queue
+      non-empty when ``pop_slow`` checked it).
+
+    Reads are best-effort (``memory_order_relaxed``); the snapshot
+    may observe individual counters from different points in time.
+
+    :return: A list of per-worker stats dicts.
+    :rtype: list[dict[str, Any]]
+    """
+
+
+def queue_stats() -> list[dict[str, Any]]:
+    """Snapshot the per-tagged-queue contention counters.
+
+    Returns one dict per assigned ``BOCQueue``. Each dict carries
+    ``queue_index``, ``tag`` (str or ``None``), ``enqueue_cas_retries``,
+    ``dequeue_cas_retries``, ``pushed_total``, and ``popped_total``.
+    Reads are best-effort (``memory_order_relaxed``).
+
+    :return: A list of per-queue stats dicts.
+    :rtype: list[dict[str, Any]]
+    """
diff --git a/src/bocpy/_internal_test.c b/src/bocpy/_internal_test.c
new file mode 100644
index 0000000..32953e9
--- /dev/null
+++ b/src/bocpy/_internal_test.c
@@ -0,0 +1,73 @@
+/// @file _internal_test.c
+/// @brief Bridge module that aggregates per-domain test helpers under
+///        `bocpy._internal_test`.
+///
+/// Each domain is a separate translation unit (`_internal_test_*.c`)
+/// that exposes a `boc_internal_test_register_<domain>` registrar.
+/// This file owns only the `PyModuleDef` + `PyInit__internal_test`
+/// scaffolding and calls every registrar once at import time.
+///
+/// Domains so far:
+///   - `atomics_*` — typed `boc_atomic_*_explicit` API
+///                   (`_internal_test_atomics.c`).
+///   - `bq_*`      — Verona-style behaviour MPMC queue
+///                   (`_internal_test_bq.c`).
+///
+/// The module deliberately does NOT link against `_core` or `_math`.
+/// It links only the units it tests (`compat.c`, `sched.c`) so the
+/// test surface stays minimal and there is no sub-interpreter
+/// machinery in the way of the test threads.
+
+#define PY_SSIZE_T_CLEAN
+
+#include <Python.h>
+
+extern int boc_internal_test_register_atomics(PyObject *module);
+extern int boc_internal_test_register_bq(PyObject *module);
+extern int boc_internal_test_register_wsq(PyObject *module);
+
+/// @brief Multi-phase init: register the test methods on the module.
+/// @details Single-phase init re-enables the GIL on free-threaded
+/// builds (CPython 3.13t+) because there is no slot to declare GIL
+/// independence. Multi-phase init lets us set @c Py_mod_gil to
+/// @c Py_MOD_GIL_NOT_USED. The harness only manipulates POD test
+/// fixtures (typed atomics under @c _atomics, raw bq nodes under
+/// @c _bq) and does not touch any Python state that would race
+/// without the GIL.
+static int _internal_test_exec(PyObject *m) {
+  if (boc_internal_test_register_atomics(m) < 0) {
+    return -1;
+  }
+  if (boc_internal_test_register_bq(m) < 0) {
+    return -1;
+  }
+  if (boc_internal_test_register_wsq(m) < 0) {
+    return -1;
+  }
+  return 0;
+}
+
+static PyModuleDef_Slot _internal_test_slots[] = {
+    {Py_mod_exec, (void *)_internal_test_exec},
+#if PY_VERSION_HEX >= 0x030C0000
+    {Py_mod_multiple_interpreters, Py_MOD_PER_INTERPRETER_GIL_SUPPORTED},
+#endif
+#if PY_VERSION_HEX >= 0x030D0000
+    {Py_mod_gil, Py_MOD_GIL_NOT_USED},
+#endif
+    {0, NULL},
+};
+
+static struct PyModuleDef moduledef = {
+    PyModuleDef_HEAD_INIT,
+    .m_name = "_internal_test",
+    .m_doc = "Test harness for bocpy internal C primitives "
+             "(typed atomics, MPMC queue, ...).",
+    .m_size = 0,
+    .m_methods = NULL, // methods are added by registrars in exec slot
+    .m_slots = _internal_test_slots,
+};
+
+PyMODINIT_FUNC PyInit__internal_test(void) {
+  return PyModuleDef_Init(&moduledef);
+}
diff --git a/src/bocpy/_internal_test_atomics.c b/src/bocpy/_internal_test_atomics.c
new file mode 100644
index 0000000..0550285
--- /dev/null
+++ b/src/bocpy/_internal_test_atomics.c
@@ -0,0 +1,423 @@
+/// @file _internal_test_atomics.c
+/// @brief Atomics-domain tests for the `bocpy._internal_test` extension.
+///
+/// Exposes the typed `boc_atomic_*_explicit` API from `compat.h` to
+/// Python so `test/test_compat_atomics.py` can drive the inline
+/// atomic primitives from real Python threads (which give us true
+/// parallelism either via free-threaded CPython or via
+/// `Py_BEGIN_ALLOW_THREADS` on regular CPython). On x86/x64 the test
+/// is a smoke test of the dispatch; on ARM64 it is a weak-memory
+/// correctness test for the acquire/release pair.
+///
+/// Methods are exported under the `atomics_*` prefix on the
+/// `bocpy._internal_test` module via @ref boc_internal_test_register_atomics.
+
+#define PY_SSIZE_T_CLEAN
+
+#include <Python.h>
+#include <stdbool.h>
+#include <stdint.h>
+
+#include "compat.h"
+
+// Single shared block of atomic slots, accessed by every test entry
+// point through a PyCapsule handle. Cacheline-sized (64B) to avoid
+// false-sharing between the producer and consumer fields when the
+// test spawns multiple threads.
+typedef struct {
+  boc_atomic_u64_t flag;       // 0 → producer not yet ready
+  uint64_t payload;            // plain (non-atomic); guarded by flag
+  boc_atomic_u64_t counter64;  // fetch_add / CAS contention slot
+  boc_atomic_u32_t counter32;  // 32-bit fetch_add contention slot
+  boc_atomic_bool_t bool_slot; // bool exchange / cas test
+  boc_atomic_ptr_t ptr_slot;   // ptr exchange / cas test
+  char _padding[64];
+} hs_state_t;
+
+static void hs_destroy(PyObject *cap) {
+  void *p = PyCapsule_GetPointer(cap, "boc_hs_state");
+  PyMem_RawFree(p);
+}
+
+static hs_state_t *hs_get(PyObject *cap) {
+  return (hs_state_t *)PyCapsule_GetPointer(cap, "boc_hs_state");
+}
+
+// ---------------------------------------------------------------------------
+// State setup / inspection.
+// ---------------------------------------------------------------------------
+
+static PyObject *py_make_state(PyObject *Py_UNUSED(self),
+                               PyObject *Py_UNUSED(args)) {
+  hs_state_t *h = (hs_state_t *)PyMem_RawCalloc(1, sizeof(*h));
+  if (h == NULL) {
+    return PyErr_NoMemory();
+  }
+  return PyCapsule_New(h, "boc_hs_state", hs_destroy);
+}
+
+static PyObject *py_reset(PyObject *Py_UNUSED(self), PyObject *cap) {
+  hs_state_t *h = hs_get(cap);
+  if (h == NULL) {
+    return NULL;
+  }
+  boc_atomic_store_u64_explicit(&h->flag, 0, BOC_MO_SEQ_CST);
+  h->payload = 0;
+  boc_atomic_store_u64_explicit(&h->counter64, 0, BOC_MO_SEQ_CST);
+  boc_atomic_store_u32_explicit(&h->counter32, 0, BOC_MO_SEQ_CST);
+  boc_atomic_store_bool_explicit(&h->bool_slot, false, BOC_MO_SEQ_CST);
+  boc_atomic_store_ptr_explicit(&h->ptr_slot, NULL, BOC_MO_SEQ_CST);
+  Py_RETURN_NONE;
+}
+
+static PyObject *py_load_counter64(PyObject *Py_UNUSED(self), PyObject *cap) {
+  hs_state_t *h = hs_get(cap);
+  if (h == NULL) {
+    return NULL;
+  }
+  return PyLong_FromUnsignedLongLong(
+      boc_atomic_load_u64_explicit(&h->counter64, BOC_MO_SEQ_CST));
+}
+
+static PyObject *py_load_counter32(PyObject *Py_UNUSED(self), PyObject *cap) {
+  hs_state_t *h = hs_get(cap);
+  if (h == NULL) {
+    return NULL;
+  }
+  return PyLong_FromUnsignedLong((unsigned long)boc_atomic_load_u32_explicit(
+      &h->counter32, BOC_MO_SEQ_CST));
+}
+
+static PyObject *py_load_bool(PyObject *Py_UNUSED(self), PyObject *cap) {
+  hs_state_t *h = hs_get(cap);
+  if (h == NULL) {
+    return NULL;
+  }
+  bool v = boc_atomic_load_bool_explicit(&h->bool_slot, BOC_MO_SEQ_CST);
+  if (v) {
+    Py_RETURN_TRUE;
+  }
+  Py_RETURN_FALSE;
+}
+
+static PyObject *py_load_ptr(PyObject *Py_UNUSED(self), PyObject *cap) {
+  hs_state_t *h = hs_get(cap);
+  if (h == NULL) {
+    return NULL;
+  }
+  void *v = boc_atomic_load_ptr_explicit(&h->ptr_slot, BOC_MO_SEQ_CST);
+  return PyLong_FromVoidPtr(v);
+}
+
+// ---------------------------------------------------------------------------
+// Acquire / release handshake (the canonical weak-memory test).
+// ---------------------------------------------------------------------------
+
+static PyObject *py_producer(PyObject *Py_UNUSED(self), PyObject *args) {
+  PyObject *cap;
+  unsigned long long payload;
+  if (!PyArg_ParseTuple(args, "OK", &cap, &payload)) {
+    return NULL;
+  }
+  hs_state_t *h = hs_get(cap);
+  if (h == NULL) {
+    return NULL;
+  }
+  Py_BEGIN_ALLOW_THREADS
+      // Plain non-atomic write of the payload, then a release store of
+      // the flag. A consumer that observes flag==1 with an acquire load
+      // MUST see the payload write (acq-rel synchronises-with).
+      h->payload = (uint64_t)payload;
+  boc_atomic_store_u64_explicit(&h->flag, 1, BOC_MO_RELEASE);
+  Py_END_ALLOW_THREADS Py_RETURN_NONE;
+}
+
+static PyObject *py_consumer(PyObject *Py_UNUSED(self), PyObject *cap) {
+  hs_state_t *h = hs_get(cap);
+  if (h == NULL) {
+    return NULL;
+  }
+  uint64_t got;
+  Py_BEGIN_ALLOW_THREADS while (
+      boc_atomic_load_u64_explicit(&h->flag, BOC_MO_ACQUIRE) == 0) {
+    // tight spin; the producer thread is the only writer
+  }
+  got = h->payload;
+  Py_END_ALLOW_THREADS return PyLong_FromUnsignedLongLong(
+      (unsigned long long)got);
+}
+
+// ---------------------------------------------------------------------------
+// Multi-thread fetch_add contention (relaxed counter).
+// ---------------------------------------------------------------------------
+
+static PyObject *py_fetch_add_loop_u64(PyObject *Py_UNUSED(self),
+                                       PyObject *args) {
+  PyObject *cap;
+  Py_ssize_t iters;
+  if (!PyArg_ParseTuple(args, "On", &cap, &iters)) {
+    return NULL;
+  }
+  hs_state_t *h = hs_get(cap);
+  if (h == NULL) {
+    return NULL;
+  }
+  Py_BEGIN_ALLOW_THREADS for (Py_ssize_t i = 0; i < iters; ++i) {
+    boc_atomic_fetch_add_u64_explicit(&h->counter64, 1, BOC_MO_RELAXED);
+  }
+  Py_END_ALLOW_THREADS Py_RETURN_NONE;
+}
+
+static PyObject *py_fetch_add_loop_u32(PyObject *Py_UNUSED(self),
+                                       PyObject *args) {
+  PyObject *cap;
+  Py_ssize_t iters;
+  if (!PyArg_ParseTuple(args, "On", &cap, &iters)) {
+    return NULL;
+  }
+  hs_state_t *h = hs_get(cap);
+  if (h == NULL) {
+    return NULL;
+  }
+  Py_BEGIN_ALLOW_THREADS for (Py_ssize_t i = 0; i < iters; ++i) {
+    boc_atomic_fetch_add_u32_explicit(&h->counter32, 1, BOC_MO_RELAXED);
+  }
+  Py_END_ALLOW_THREADS Py_RETURN_NONE;
+}
+
+// ---------------------------------------------------------------------------
+// Multi-thread CAS contention loop (acq_rel on success, relaxed on failure).
+// ---------------------------------------------------------------------------
+
+static PyObject *py_cas_increment_loop_u64(PyObject *Py_UNUSED(self),
+                                           PyObject *args) {
+  PyObject *cap;
+  Py_ssize_t iters;
+  if (!PyArg_ParseTuple(args, "On", &cap, &iters)) {
+    return NULL;
+  }
+  hs_state_t *h = hs_get(cap);
+  if (h == NULL) {
+    return NULL;
+  }
+  Py_BEGIN_ALLOW_THREADS for (Py_ssize_t i = 0; i < iters; ++i) {
+    uint64_t cur = boc_atomic_load_u64_explicit(&h->counter64, BOC_MO_RELAXED);
+    while (!boc_atomic_compare_exchange_strong_u64_explicit(
+        &h->counter64, &cur, cur + 1, BOC_MO_ACQ_REL, BOC_MO_RELAXED)) {
+      // CAS updates `cur` on failure; loop body is empty.
+    }
+  }
+  Py_END_ALLOW_THREADS Py_RETURN_NONE;
+}
+
+// ---------------------------------------------------------------------------
+// Single-threaded round-trip: every (op, type, order) at least once.
+// ---------------------------------------------------------------------------
+//
+// On Linux the typed API is a thin wrapper around <stdatomic.h>, so this
+// is mostly a "does it compile and link" smoke. On MSVC it exercises the
+// per-order Interlocked* dispatch; on ARM64 MSVC it exercises the
+// __ldar*/__stlr* fast paths.
+
+static int round_trip_u64(void) {
+  boc_atomic_u64_t slot = 0;
+  const boc_memory_order_t orders[] = {BOC_MO_RELAXED, BOC_MO_ACQUIRE,
+                                       BOC_MO_RELEASE, BOC_MO_ACQ_REL,
+                                       BOC_MO_SEQ_CST};
+  for (size_t i = 0; i < sizeof(orders) / sizeof(orders[0]); ++i) {
+    boc_memory_order_t o = orders[i];
+    // store/load round-trip.
+    boc_atomic_store_u64_explicit(&slot, 0x1234567890ABCDEFULL, o);
+    if (boc_atomic_load_u64_explicit(&slot, o) != 0x1234567890ABCDEFULL) {
+      return -1;
+    }
+    // exchange returns previous, installs new.
+    uint64_t prev = boc_atomic_exchange_u64_explicit(&slot, 42ULL, o);
+    if (prev != 0x1234567890ABCDEFULL ||
+        boc_atomic_load_u64_explicit(&slot, o) != 42ULL) {
+      return -1;
+    }
+    // fetch_add / fetch_sub.
+    if (boc_atomic_fetch_add_u64_explicit(&slot, 8ULL, o) != 42ULL ||
+        boc_atomic_load_u64_explicit(&slot, o) != 50ULL) {
+      return -1;
+    }
+    if (boc_atomic_fetch_sub_u64_explicit(&slot, 5ULL, o) != 50ULL ||
+        boc_atomic_load_u64_explicit(&slot, o) != 45ULL) {
+      return -1;
+    }
+    // CAS success.
+    uint64_t exp = 45ULL;
+    if (!boc_atomic_compare_exchange_strong_u64_explicit(&slot, &exp, 99ULL, o,
+                                                         BOC_MO_RELAXED) ||
+        boc_atomic_load_u64_explicit(&slot, o) != 99ULL) {
+      return -1;
+    }
+    // CAS failure must update `exp` to the current value.
+    exp = 0ULL;
+    if (boc_atomic_compare_exchange_strong_u64_explicit(&slot, &exp, 7ULL, o,
+                                                        BOC_MO_RELAXED) ||
+        exp != 99ULL) {
+      return -1;
+    }
+  }
+  return 0;
+}
+
+static int round_trip_u32(void) {
+  boc_atomic_u32_t slot = 0;
+  const boc_memory_order_t orders[] = {BOC_MO_RELAXED, BOC_MO_ACQUIRE,
+                                       BOC_MO_RELEASE, BOC_MO_ACQ_REL,
+                                       BOC_MO_SEQ_CST};
+  for (size_t i = 0; i < sizeof(orders) / sizeof(orders[0]); ++i) {
+    boc_memory_order_t o = orders[i];
+    boc_atomic_store_u32_explicit(&slot, 0xCAFEBABEU, o);
+    if (boc_atomic_load_u32_explicit(&slot, o) != 0xCAFEBABEU) {
+      return -1;
+    }
+    uint32_t prev = boc_atomic_exchange_u32_explicit(&slot, 7U, o);
+    if (prev != 0xCAFEBABEU || boc_atomic_load_u32_explicit(&slot, o) != 7U) {
+      return -1;
+    }
+    if (boc_atomic_fetch_add_u32_explicit(&slot, 3U, o) != 7U ||
+        boc_atomic_load_u32_explicit(&slot, o) != 10U) {
+      return -1;
+    }
+    if (boc_atomic_fetch_sub_u32_explicit(&slot, 4U, o) != 10U ||
+        boc_atomic_load_u32_explicit(&slot, o) != 6U) {
+      return -1;
+    }
+    uint32_t exp = 6U;
+    if (!boc_atomic_compare_exchange_strong_u32_explicit(&slot, &exp, 99U, o,
+                                                         BOC_MO_RELAXED) ||
+        boc_atomic_load_u32_explicit(&slot, o) != 99U) {
+      return -1;
+    }
+    exp = 0U;
+    if (boc_atomic_compare_exchange_strong_u32_explicit(&slot, &exp, 7U, o,
+                                                        BOC_MO_RELAXED) ||
+        exp != 99U) {
+      return -1;
+    }
+  }
+  return 0;
+}
+
+static int round_trip_bool(void) {
+  boc_atomic_bool_t slot = false;
+  const boc_memory_order_t orders[] = {BOC_MO_RELAXED, BOC_MO_ACQUIRE,
+                                       BOC_MO_RELEASE, BOC_MO_ACQ_REL,
+                                       BOC_MO_SEQ_CST};
+  for (size_t i = 0; i < sizeof(orders) / sizeof(orders[0]); ++i) {
+    boc_memory_order_t o = orders[i];
+    boc_atomic_store_bool_explicit(&slot, true, o);
+    if (!boc_atomic_load_bool_explicit(&slot, o)) {
+      return -1;
+    }
+    bool prev = boc_atomic_exchange_bool_explicit(&slot, false, o);
+    if (!prev || boc_atomic_load_bool_explicit(&slot, o)) {
+      return -1;
+    }
+    bool exp = false;
+    if (!boc_atomic_compare_exchange_strong_bool_explicit(&slot, &exp, true, o,
+                                                          BOC_MO_RELAXED) ||
+        !boc_atomic_load_bool_explicit(&slot, o)) {
+      return -1;
+    }
+    exp = false;
+    if (boc_atomic_compare_exchange_strong_bool_explicit(&slot, &exp, false, o,
+                                                         BOC_MO_RELAXED) ||
+        exp != true) {
+      return -1;
+    }
+  }
+  return 0;
+}
+
+static int round_trip_ptr(void) {
+  boc_atomic_ptr_t slot = NULL;
+  int sentinel_a, sentinel_b;
+  void *a = (void *)&sentinel_a;
+  void *b = (void *)&sentinel_b;
+  const boc_memory_order_t orders[] = {BOC_MO_RELAXED, BOC_MO_ACQUIRE,
+                                       BOC_MO_RELEASE, BOC_MO_ACQ_REL,
+                                       BOC_MO_SEQ_CST};
+  for (size_t i = 0; i < sizeof(orders) / sizeof(orders[0]); ++i) {
+    boc_memory_order_t o = orders[i];
+    boc_atomic_store_ptr_explicit(&slot, a, o);
+    if (boc_atomic_load_ptr_explicit(&slot, o) != a) {
+      return -1;
+    }
+    void *prev = boc_atomic_exchange_ptr_explicit(&slot, b, o);
+    if (prev != a || boc_atomic_load_ptr_explicit(&slot, o) != b) {
+      return -1;
+    }
+    void *exp = b;
+    if (!boc_atomic_compare_exchange_strong_ptr_explicit(&slot, &exp, a, o,
+                                                         BOC_MO_RELAXED) ||
+        boc_atomic_load_ptr_explicit(&slot, o) != a) {
+      return -1;
+    }
+    exp = NULL;
+    if (boc_atomic_compare_exchange_strong_ptr_explicit(&slot, &exp, b, o,
+                                                        BOC_MO_RELAXED) ||
+        exp != a) {
+      return -1;
+    }
+  }
+  return 0;
+}
+
+static PyObject *py_round_trip(PyObject *Py_UNUSED(self),
+                               PyObject *Py_UNUSED(args)) {
+  if (round_trip_u64() < 0) {
+    PyErr_SetString(PyExc_AssertionError, "round_trip_u64 failed");
+    return NULL;
+  }
+  if (round_trip_u32() < 0) {
+    PyErr_SetString(PyExc_AssertionError, "round_trip_u32 failed");
+    return NULL;
+  }
+  if (round_trip_bool() < 0) {
+    PyErr_SetString(PyExc_AssertionError, "round_trip_bool failed");
+    return NULL;
+  }
+  if (round_trip_ptr() < 0) {
+    PyErr_SetString(PyExc_AssertionError, "round_trip_ptr failed");
+    return NULL;
+  }
+  Py_RETURN_NONE;
+}
+
+// ---------------------------------------------------------------------------
+// Registrar.
+// ---------------------------------------------------------------------------
+
+static PyMethodDef methods[] = {
+    {"atomics_make_state", py_make_state, METH_NOARGS,
+     "Allocate a fresh state slot."},
+    {"atomics_reset", py_reset, METH_O, "Reset all slots to zero/null/false."},
+    {"atomics_load_counter64", py_load_counter64, METH_O,
+     "Load the u64 counter."},
+    {"atomics_load_counter32", py_load_counter32, METH_O,
+     "Load the u32 counter."},
+    {"atomics_load_bool", py_load_bool, METH_O, "Load the bool slot."},
+    {"atomics_load_ptr", py_load_ptr, METH_O, "Load the ptr slot as int."},
+    {"atomics_producer", py_producer, METH_VARARGS,
+     "Write payload, then release-store flag=1."},
+    {"atomics_consumer", py_consumer, METH_O,
+     "Acquire-spin on flag, then read payload."},
+    {"atomics_fetch_add_loop_u64", py_fetch_add_loop_u64, METH_VARARGS,
+     "Relaxed fetch_add(+1) on counter64 in a tight loop."},
+    {"atomics_fetch_add_loop_u32", py_fetch_add_loop_u32, METH_VARARGS,
+     "Relaxed fetch_add(+1) on counter32 in a tight loop."},
+    {"atomics_cas_increment_loop_u64", py_cas_increment_loop_u64, METH_VARARGS,
+     "Acq_rel CAS-increment of counter64 in a tight loop."},
+    {"atomics_round_trip", py_round_trip, METH_NOARGS,
+     "Single-threaded smoke test of every (op, type, order)."},
+    {NULL, NULL, 0, NULL},
+};
+
+int boc_internal_test_register_atomics(PyObject *module) {
+  return PyModule_AddFunctions(module, methods);
+}
diff --git a/src/bocpy/_internal_test_bq.c b/src/bocpy/_internal_test_bq.c
new file mode 100644
index 0000000..64246c6
--- /dev/null
+++ b/src/bocpy/_internal_test_bq.c
@@ -0,0 +1,347 @@
+/// @file _internal_test_bq.c
+/// @brief BQ-domain (Verona MPMC behaviour queue) tests for
+///        `bocpy._internal_test`.
+///
+/// Exposes the `boc_bq_*` API from `sched.h` to Python so
+/// `test/test_internal_mpmcq.py` can stress the queue from multiple
+/// real threads. Methods are registered on the `bocpy._internal_test`
+/// module under the `bq_*` prefix.
+///
+/// Nodes here are bare `boc_bq_node_t` allocations carrying an
+/// integer identity used by tests to verify FIFO ordering and
+/// segment chains. Production code uses `BOCBehavior::bq_node` from
+/// `_core.c` (verified via `pahole`); the queue itself is layout-
+/// agnostic.
+
+#define PY_SSIZE_T_CLEAN
+
+#include <Python.h>
+
+#include <stdbool.h>
+#include <stdint.h>
+#include <stdlib.h>
+
+#include "compat.h"
+#include "sched.h"
+
+// ---------------------------------------------------------------------------
+// Node and queue capsule helpers
+// ---------------------------------------------------------------------------
+
+/// @brief Test node: a `boc_bq_node_t` followed by an integer identity.
+typedef struct {
+  boc_bq_node_t node; ///< Link field consumed by `boc_bq_*`.
+  int64_t id;         ///< Caller-supplied identity for FIFO checks.
+} bq_test_node_t;
+
+#define BQ_QUEUE_CAPSULE_NAME "bocpy._internal_test.bq_queue"
+#define BQ_NODE_CAPSULE_NAME "bocpy._internal_test.bq_node"
+
+static void bq_queue_capsule_destructor(PyObject *capsule) {
+  boc_bq_t *q =
+      (boc_bq_t *)PyCapsule_GetPointer(capsule, BQ_QUEUE_CAPSULE_NAME);
+  if (q != NULL) {
+    // Drain any leftover nodes so destroy_assert_empty does not abort
+    // on a leaked test queue. We do NOT free the nodes here; the
+    // Python side owns them via their own capsules.
+    boc_bq_node_t *n;
+    while ((n = boc_bq_dequeue(q)) != NULL) {
+      (void)n;
+    }
+    boc_bq_destroy_assert_empty(q);
+    // Raw allocator: bq queues exist precisely to be crossed between
+    // sub-interpreters in production (per-worker queues), so the test
+    // harness uses the same process-global allocator to avoid masking
+    // a cross-interpreter free bug behind a same-interpreter test.
+    PyMem_RawFree(q);
+  }
+}
+
+static void bq_node_capsule_destructor(PyObject *capsule) {
+  bq_test_node_t *n =
+      (bq_test_node_t *)PyCapsule_GetPointer(capsule, BQ_NODE_CAPSULE_NAME);
+  if (n != NULL) {
+    // Raw allocator: see bq_queue_capsule_destructor above.
+    PyMem_RawFree(n);
+  }
+}
+
+static boc_bq_t *bq_queue_from_capsule(PyObject *capsule) {
+  return (boc_bq_t *)PyCapsule_GetPointer(capsule, BQ_QUEUE_CAPSULE_NAME);
+}
+
+static bq_test_node_t *bq_node_from_capsule(PyObject *capsule) {
+  return (bq_test_node_t *)PyCapsule_GetPointer(capsule, BQ_NODE_CAPSULE_NAME);
+}
+
+// ---------------------------------------------------------------------------
+// Methods
+// ---------------------------------------------------------------------------
+
+static PyObject *bq_make_queue(PyObject *Py_UNUSED(self),
+                               PyObject *Py_UNUSED(args)) {
+  boc_bq_t *q = PyMem_RawMalloc(sizeof(boc_bq_t));
+  if (q == NULL) {
+    return PyErr_NoMemory();
+  }
+  boc_bq_init(q);
+  PyObject *capsule =
+      PyCapsule_New(q, BQ_QUEUE_CAPSULE_NAME, bq_queue_capsule_destructor);
+  if (capsule == NULL) {
+    PyMem_RawFree(q);
+    return NULL;
+  }
+  return capsule;
+}
+
+static PyObject *bq_make_node(PyObject *Py_UNUSED(self), PyObject *args) {
+  long long id;
+  if (!PyArg_ParseTuple(args, "L:bq_make_node", &id)) {
+    return NULL;
+  }
+  bq_test_node_t *n = PyMem_RawMalloc(sizeof(bq_test_node_t));
+  if (n == NULL) {
+    return PyErr_NoMemory();
+  }
+  boc_atomic_store_ptr_explicit(&n->node.next_in_queue, NULL, BOC_MO_RELAXED);
+  n->id = (int64_t)id;
+  PyObject *capsule =
+      PyCapsule_New(n, BQ_NODE_CAPSULE_NAME, bq_node_capsule_destructor);
+  if (capsule == NULL) {
+    PyMem_RawFree(n);
+    return NULL;
+  }
+  return capsule;
+}
+
+static PyObject *bq_node_id(PyObject *Py_UNUSED(self), PyObject *args) {
+  PyObject *cap;
+  if (!PyArg_ParseTuple(args, "O:bq_node_id", &cap)) {
+    return NULL;
+  }
+  bq_test_node_t *n = bq_node_from_capsule(cap);
+  if (n == NULL) {
+    return NULL;
+  }
+  return PyLong_FromLongLong((long long)n->id);
+}
+
+/// @brief Read back the raw bq_node pointer of a node capsule.
+/// @details Returns @c &node->node (the embedded @c boc_bq_node_t)
+/// as an integer. Used by the dispatch test to compare pointer
+/// identities against the integer returned by
+/// @c _core.scheduler_pop_fast.
+static PyObject *bq_node_ptr(PyObject *Py_UNUSED(self), PyObject *args) {
+  PyObject *cap;
+  if (!PyArg_ParseTuple(args, "O:bq_node_ptr", &cap)) {
+    return NULL;
+  }
+  bq_test_node_t *n = bq_node_from_capsule(cap);
+  if (n == NULL) {
+    return NULL;
+  }
+  // bq_test_node_t puts `node` first, so &n == &n->node, but be
+  // explicit for clarity and to keep the test invariant readable.
+  return PyLong_FromVoidPtr((void *)&n->node);
+}
+
+static PyObject *bq_enqueue(PyObject *Py_UNUSED(self), PyObject *args) {
+  PyObject *qcap, *ncap;
+  if (!PyArg_ParseTuple(args, "OO:bq_enqueue", &qcap, &ncap)) {
+    return NULL;
+  }
+  boc_bq_t *q = bq_queue_from_capsule(qcap);
+  bq_test_node_t *n = bq_node_from_capsule(ncap);
+  if (q == NULL || n == NULL) {
+    return NULL;
+  }
+  Py_BEGIN_ALLOW_THREADS boc_bq_enqueue(q, &n->node);
+  Py_END_ALLOW_THREADS Py_RETURN_NONE;
+}
+
+static PyObject *bq_enqueue_front(PyObject *Py_UNUSED(self), PyObject *args) {
+  PyObject *qcap, *ncap;
+  if (!PyArg_ParseTuple(args, "OO:bq_enqueue_front", &qcap, &ncap)) {
+    return NULL;
+  }
+  boc_bq_t *q = bq_queue_from_capsule(qcap);
+  bq_test_node_t *n = bq_node_from_capsule(ncap);
+  if (q == NULL || n == NULL) {
+    return NULL;
+  }
+  Py_BEGIN_ALLOW_THREADS boc_bq_enqueue_front(q, &n->node);
+  Py_END_ALLOW_THREADS Py_RETURN_NONE;
+}
+
+static PyObject *bq_dequeue(PyObject *Py_UNUSED(self), PyObject *args) {
+  PyObject *qcap;
+  if (!PyArg_ParseTuple(args, "O:bq_dequeue", &qcap)) {
+    return NULL;
+  }
+  boc_bq_t *q = bq_queue_from_capsule(qcap);
+  if (q == NULL) {
+    return NULL;
+  }
+  boc_bq_node_t *raw;
+  Py_BEGIN_ALLOW_THREADS raw = boc_bq_dequeue(q);
+  Py_END_ALLOW_THREADS if (raw == NULL) { Py_RETURN_NONE; }
+  // Recover the embedding test-node and return its id. Tests don't
+  // need the original capsule object back; identity is the contract.
+  bq_test_node_t *n = (bq_test_node_t *)raw;
+  return PyLong_FromLongLong((long long)n->id);
+}
+
+static PyObject *bq_dequeue_all(PyObject *Py_UNUSED(self), PyObject *args) {
+  PyObject *qcap;
+  if (!PyArg_ParseTuple(args, "O:bq_dequeue_all", &qcap)) {
+    return NULL;
+  }
+  boc_bq_t *q = bq_queue_from_capsule(qcap);
+  if (q == NULL) {
+    return NULL;
+  }
+  boc_bq_segment_t seg;
+  Py_BEGIN_ALLOW_THREADS seg = boc_bq_dequeue_all(q);
+  Py_END_ALLOW_THREADS
+
+      PyObject *list = PyList_New(0);
+  if (list == NULL) {
+    return NULL;
+  }
+  if (seg.start == NULL) {
+    return list;
+  }
+  // Walk the segment via segment_take_one. take_one returns NULL for
+  // three reasons (mpmcq.h:67-89, also documented at
+  // sched.c::boc_sched_steal):
+  //   1. fully empty (impossible here — guarded above),
+  //   2. singleton segment (end == &start->next_in_queue) — append
+  //      start as the tail and return,
+  //   3. broken link: producer P has CASed itself onto the queue
+  //      tail (back.exchange) but has not yet completed the
+  //      "publish next pointer" store. seg.start->next_in_queue
+  //      reads as NULL, but the segment is NOT singleton — there
+  //      is at least one more node the producer is mid-publish.
+  //
+  // Verona's WorkStealingQueue::steal handles case 3 by spreading
+  // the partial segment back across its multi-N WSQ. The bocpy
+  // production caller (boc_sched_steal) handles it by splicing the
+  // partial segment onto self->q, deferring the missing tail to a
+  // subsequent dequeue once the producer's store lands.
+  //
+  // For a test helper there is no other queue to spread/splice
+  // onto, AND the test contract is "every enqueued item is observed
+  // exactly once". The pragmatic answer is to BUSY-SPIN on the
+  // broken next pointer until the producer's store becomes visible.
+  // The producer is mid-call (between `back.exchange` and
+  // `b->store(seg.start, release)` — three instructions wide), so
+  // the spin is bounded by producer scheduling latency. Without
+  // this spin the previous implementation silently dropped the
+  // entire post-broken-link tail, manifesting as the
+  // `[8-100000]` stress test losing 1-227 items per run.
+  for (;;) {
+    boc_bq_node_t *taken = boc_bq_segment_take_one(&seg);
+    if (taken != NULL) {
+      bq_test_node_t *n = (bq_test_node_t *)taken;
+      PyObject *id = PyLong_FromLongLong((long long)n->id);
+      if (id == NULL || PyList_Append(list, id) < 0) {
+        Py_XDECREF(id);
+        Py_DECREF(list);
+        return NULL;
+      }
+      Py_DECREF(id);
+      continue;
+    }
+    // take_one returned NULL. Distinguish singleton from broken-link
+    // (case 1 is impossible; we guarded seg.start != NULL above and
+    // each take_one advances seg.start to a known-non-NULL node).
+    if (seg.end == &seg.start->next_in_queue) {
+      // Singleton tail — done.
+      break;
+    }
+    // Broken-link case: spin until the producer publishes. The wait
+    // is bounded by producer scheduling latency; under TSan or
+    // heavy oversubscription it could be milliseconds, but it is
+    // never unbounded — the producer is mid-call by construction.
+    // Drop the GIL across the spin so other Python threads (e.g.
+    // the other consumer in the stress test) can make progress.
+    Py_BEGIN_ALLOW_THREADS while (
+        boc_atomic_load_ptr_explicit(&seg.start->next_in_queue,
+                                     BOC_MO_ACQUIRE) == NULL) {
+      // Compiler/CPU hint: tight spin on a single cacheline. No
+      // platform-specific PAUSE intrinsic here — the spin is short
+      // and the cost is dwarfed by GIL re-acquire.
+    }
+    Py_END_ALLOW_THREADS
+    // Producer's store is now visible; loop and let take_one walk it.
+  }
+  // Append the tail node (seg.start now points at it; its
+  // next_in_queue is NULL by segment-end invariant).
+  bq_test_node_t *tail = (bq_test_node_t *)seg.start;
+  PyObject *tail_id = PyLong_FromLongLong((long long)tail->id);
+  if (tail_id == NULL || PyList_Append(list, tail_id) < 0) {
+    Py_XDECREF(tail_id);
+    Py_DECREF(list);
+    return NULL;
+  }
+  Py_DECREF(tail_id);
+  return list;
+}
+
+static PyObject *bq_is_empty(PyObject *Py_UNUSED(self), PyObject *args) {
+  PyObject *qcap;
+  if (!PyArg_ParseTuple(args, "O:bq_is_empty", &qcap)) {
+    return NULL;
+  }
+  boc_bq_t *q = bq_queue_from_capsule(qcap);
+  if (q == NULL) {
+    return NULL;
+  }
+  if (boc_bq_is_empty(q)) {
+    Py_RETURN_TRUE;
+  }
+  Py_RETURN_FALSE;
+}
+
+// ---------------------------------------------------------------------------
+// Method table and registrar
+// ---------------------------------------------------------------------------
+
+static PyMethodDef bq_methods[] = {
+    {"bq_make_queue", bq_make_queue, METH_NOARGS,
+     "Create an empty MPMC behaviour queue. Returns a capsule."},
+    {"bq_make_node", bq_make_node, METH_VARARGS,
+     "bq_make_node(id) -> capsule. Allocate a test node with the "
+     "given integer identity."},
+    {"bq_node_id", bq_node_id, METH_VARARGS,
+     "bq_node_id(node) -> int. Read back the node's identity."},
+    {"bq_node_ptr", bq_node_ptr, METH_VARARGS,
+     "bq_node_ptr(node) -> int. Raw boc_bq_node_t* as an integer "
+     "(for pointer-identity comparisons against scheduler_pop_fast)."},
+    {"bq_enqueue", bq_enqueue, METH_VARARGS,
+     "bq_enqueue(q, node). Append a node to the queue."},
+    {"bq_enqueue_front", bq_enqueue_front, METH_VARARGS,
+     "bq_enqueue_front(q, node). Push a node onto the front of the queue."},
+    {"bq_dequeue", bq_dequeue, METH_VARARGS,
+     "bq_dequeue(q) -> id or None. Pop one node, returning its identity."},
+    {"bq_dequeue_all", bq_dequeue_all, METH_VARARGS,
+     "bq_dequeue_all(q) -> list[int]. Pop every currently-enqueued "
+     "node in FIFO order."},
+    {"bq_is_empty", bq_is_empty, METH_VARARGS,
+     "bq_is_empty(q) -> bool. True iff the queue is currently empty."},
+    {NULL, NULL, 0, NULL},
+};
+
+int boc_internal_test_register_bq(PyObject *module) {
+  for (PyMethodDef *def = bq_methods; def->ml_name != NULL; ++def) {
+    PyObject *fn = PyCFunction_New(def, NULL);
+    if (fn == NULL) {
+      return -1;
+    }
+    if (PyModule_AddObject(module, def->ml_name, fn) < 0) {
+      Py_DECREF(fn);
+      return -1;
+    }
+  }
+  return 0;
+}
diff --git a/src/bocpy/_internal_test_wsq.c b/src/bocpy/_internal_test_wsq.c
new file mode 100644
index 0000000..e6576d4
--- /dev/null
+++ b/src/bocpy/_internal_test_wsq.c
@@ -0,0 +1,346 @@
+/// @file _internal_test_wsq.c
+/// @brief WSQ-domain (work-stealing queue cursor + spread) tests for
+///        `bocpy._internal_test`.
+///
+/// Exposes the inline `boc_wsq_*` helpers from `sched.h` so
+/// `test/test_internal_wsq.py` can verify the cursor-wrap arithmetic
+/// and the `enqueue_spread` distribution invariant directly, without
+/// going through the full scheduler runtime.
+///
+/// Only `boc_wsq_pre_inc`, `boc_wsq_post_dec`, `boc_wsq_enqueue`, and
+/// `boc_wsq_enqueue_spread` are exercised here; the dispatch / steal
+/// integration is covered by the existing
+/// `test_scheduler_steal.py` / `test_scheduler_integration.py` suites
+/// once the wiring is live.
+///
+/// Worker fixtures here are bare `boc_sched_worker_t` allocations
+/// initialised by `boc_bq_init` per sub-queue and zeroed cursors —
+/// the rest of the worker struct (mutex, cv, ring link) is unused
+/// and remains zero. This is sound because the WSQ helpers touch
+/// only `q[]` and the three cursors.
+
+#define PY_SSIZE_T_CLEAN
+
+#include <Python.h>
+
+#include <stdbool.h>
+#include <stdint.h>
+#include <stdlib.h>
+#include <string.h>
+
+#include "compat.h"
+#include "sched.h"
+
+// ---------------------------------------------------------------------------
+// Worker fixture capsule
+// ---------------------------------------------------------------------------
+
+#define WSQ_WORKER_CAPSULE_NAME "bocpy._internal_test.wsq_worker"
+#define WSQ_NODE_CAPSULE_NAME "bocpy._internal_test.wsq_node"
+
+/// @brief Test node carrying an integer identity for FIFO checks.
+typedef struct {
+  boc_bq_node_t node;
+  int64_t id;
+} wsq_test_node_t;
+
+static void wsq_worker_capsule_destructor(PyObject *capsule) {
+  boc_sched_worker_t *w = (boc_sched_worker_t *)PyCapsule_GetPointer(
+      capsule, WSQ_WORKER_CAPSULE_NAME);
+  if (w == NULL) {
+    return;
+  }
+  // Drain every sub-queue so destroy_assert_empty does not abort if
+  // a test left items behind. We do NOT free the test nodes here —
+  // they are owned by the Python side via their own capsules.
+  for (size_t i = 0; i < (size_t)BOC_WSQ_N; ++i) {
+    while (boc_bq_dequeue(&w->q[i]) != NULL) {
+      // discard
+    }
+    boc_bq_destroy_assert_empty(&w->q[i]);
+  }
+  PyMem_RawFree(w);
+}
+
+static void wsq_node_capsule_destructor(PyObject *capsule) {
+  wsq_test_node_t *n =
+      (wsq_test_node_t *)PyCapsule_GetPointer(capsule, WSQ_NODE_CAPSULE_NAME);
+  if (n != NULL) {
+    PyMem_RawFree(n);
+  }
+}
+
+static boc_sched_worker_t *wsq_worker_from_capsule(PyObject *capsule) {
+  return (boc_sched_worker_t *)PyCapsule_GetPointer(capsule,
+                                                    WSQ_WORKER_CAPSULE_NAME);
+}
+
+// ---------------------------------------------------------------------------
+// Methods
+// ---------------------------------------------------------------------------
+
+static PyObject *wsq_n(PyObject *Py_UNUSED(self), PyObject *Py_UNUSED(args)) {
+  return PyLong_FromSize_t((size_t)BOC_WSQ_N);
+}
+
+static PyObject *wsq_make_worker(PyObject *Py_UNUSED(self),
+                                 PyObject *Py_UNUSED(args)) {
+  // Calloc so all unused worker fields (mutex, cv, ring link, stats,
+  // owner_interp_id, ...) are zero. The WSQ helpers only touch q[]
+  // and the three cursors, all of which we re-init explicitly.
+  boc_sched_worker_t *w = PyMem_RawCalloc(1, sizeof(boc_sched_worker_t));
+  if (w == NULL) {
+    return PyErr_NoMemory();
+  }
+  for (size_t i = 0; i < (size_t)BOC_WSQ_N; ++i) {
+    boc_bq_init(&w->q[i]);
+  }
+  w->enqueue_index.idx = 0;
+  w->dequeue_index.idx = 0;
+  w->steal_index.idx = 0;
+  PyObject *capsule =
+      PyCapsule_New(w, WSQ_WORKER_CAPSULE_NAME, wsq_worker_capsule_destructor);
+  if (capsule == NULL) {
+    for (size_t i = 0; i < (size_t)BOC_WSQ_N; ++i) {
+      boc_bq_destroy_assert_empty(&w->q[i]);
+    }
+    PyMem_RawFree(w);
+    return NULL;
+  }
+  return capsule;
+}
+
+/// @brief Run @p k pre-increments on a fresh cursor and return the
+///        per-index count as a list of length @c BOC_WSQ_N.
+/// @details Pure cursor arithmetic; no worker / queue involvement.
+/// Verifies @ref boc_wsq_pre_inc cycles indices uniformly.
+static PyObject *wsq_pre_inc_histogram(PyObject *Py_UNUSED(self),
+                                       PyObject *args) {
+  Py_ssize_t k;
+  if (!PyArg_ParseTuple(args, "n:wsq_pre_inc_histogram", &k)) {
+    return NULL;
+  }
+  if (k < 0) {
+    PyErr_SetString(PyExc_ValueError, "k must be non-negative");
+    return NULL;
+  }
+  size_t counts[BOC_WSQ_N];
+  memset(counts, 0, sizeof(counts));
+  boc_wsq_cursor_t c = {0};
+  for (Py_ssize_t i = 0; i < k; ++i) {
+    size_t idx = boc_wsq_pre_inc(&c);
+    counts[idx] += 1u;
+  }
+  PyObject *out = PyList_New((Py_ssize_t)BOC_WSQ_N);
+  if (out == NULL) {
+    return NULL;
+  }
+  for (size_t i = 0; i < (size_t)BOC_WSQ_N; ++i) {
+    PyObject *v = PyLong_FromSize_t(counts[i]);
+    if (v == NULL) {
+      Py_DECREF(out);
+      return NULL;
+    }
+    PyList_SET_ITEM(out, (Py_ssize_t)i, v);
+  }
+  return out;
+}
+
+/// @brief Run @p k post-decrements on a fresh cursor and return the
+///        sequence of returned indices.
+static PyObject *wsq_post_dec_sequence(PyObject *Py_UNUSED(self),
+                                       PyObject *args) {
+  Py_ssize_t k;
+  if (!PyArg_ParseTuple(args, "n:wsq_post_dec_sequence", &k)) {
+    return NULL;
+  }
+  if (k < 0) {
+    PyErr_SetString(PyExc_ValueError, "k must be non-negative");
+    return NULL;
+  }
+  PyObject *out = PyList_New(k);
+  if (out == NULL) {
+    return NULL;
+  }
+  boc_wsq_cursor_t c = {0};
+  for (Py_ssize_t i = 0; i < k; ++i) {
+    size_t r = boc_wsq_post_dec(&c);
+    PyObject *v = PyLong_FromSize_t(r);
+    if (v == NULL) {
+      Py_DECREF(out);
+      return NULL;
+    }
+    PyList_SET_ITEM(out, i, v);
+  }
+  return out;
+}
+
+/// @brief Push @p k freshly-allocated nodes via @ref boc_wsq_enqueue
+///        on @p worker, then drain each sub-queue in order and return
+///        a list of length @c BOC_WSQ_N giving the count per
+///        sub-queue.
+/// @details Verifies that single-node enqueues round-robin across
+/// the N sub-queues. Each pushed node carries its push-order id; the
+/// returned value is `[count[0], count[1], ..., count[N-1]]` so the
+/// caller can assert uniformity. Drained nodes are freed.
+static PyObject *wsq_enqueue_drain_counts(PyObject *Py_UNUSED(self),
+                                          PyObject *args) {
+  PyObject *worker_capsule;
+  Py_ssize_t k;
+  if (!PyArg_ParseTuple(args, "On:wsq_enqueue_drain_counts", &worker_capsule,
+                        &k)) {
+    return NULL;
+  }
+  boc_sched_worker_t *w = wsq_worker_from_capsule(worker_capsule);
+  if (w == NULL) {
+    return NULL;
+  }
+  if (k < 0) {
+    PyErr_SetString(PyExc_ValueError, "k must be non-negative");
+    return NULL;
+  }
+  for (Py_ssize_t i = 0; i < k; ++i) {
+    wsq_test_node_t *n = PyMem_RawCalloc(1, sizeof(*n));
+    if (n == NULL) {
+      return PyErr_NoMemory();
+    }
+    n->id = (int64_t)i;
+    boc_wsq_enqueue(w, &n->node);
+  }
+  PyObject *out = PyList_New((Py_ssize_t)BOC_WSQ_N);
+  if (out == NULL) {
+    return NULL;
+  }
+  for (size_t i = 0; i < (size_t)BOC_WSQ_N; ++i) {
+    size_t count = 0;
+    boc_bq_node_t *raw;
+    while ((raw = boc_bq_dequeue(&w->q[i])) != NULL) {
+      wsq_test_node_t *n = (wsq_test_node_t *)raw;
+      PyMem_RawFree(n);
+      count += 1u;
+    }
+    PyObject *v = PyLong_FromSize_t(count);
+    if (v == NULL) {
+      Py_DECREF(out);
+      return NULL;
+    }
+    PyList_SET_ITEM(out, (Py_ssize_t)i, v);
+  }
+  return out;
+}
+
+/// @brief Build a length-@p L pre-linked segment (no queue
+///        involved), call @ref boc_wsq_enqueue_spread on @p worker,
+///        then drain each sub-queue and return per-sub-queue counts.
+/// @details The segment is constructed by hand: nodes 0..L-1 with
+/// `next_in_queue` pre-linked head-to-tail, and the segment's `end`
+/// pointing at the tail node's `next_in_queue` slot. This mirrors
+/// what `boc_bq_dequeue_all` would have produced for a freshly-
+/// stolen victim queue. After spread, every node should have been
+/// distributed across `worker`'s sub-queues; the returned count list
+/// must sum to @p L.
+static PyObject *wsq_spread_segment_counts(PyObject *Py_UNUSED(self),
+                                           PyObject *args) {
+  PyObject *worker_capsule;
+  Py_ssize_t length;
+  if (!PyArg_ParseTuple(args, "On:wsq_spread_segment_counts", &worker_capsule,
+                        &length)) {
+    return NULL;
+  }
+  boc_sched_worker_t *w = wsq_worker_from_capsule(worker_capsule);
+  if (w == NULL) {
+    return NULL;
+  }
+  if (length <= 0) {
+    PyErr_SetString(PyExc_ValueError, "length must be positive");
+    return NULL;
+  }
+  // Allocate L nodes and link them head-to-tail. The link payload
+  // stored in `next_in_queue` is `boc_bq_node_t *`; we use plain
+  // stores via the typed atomic helper to construct the segment.
+  wsq_test_node_t **nodes = PyMem_RawCalloc((size_t)length, sizeof(*nodes));
+  if (nodes == NULL) {
+    return PyErr_NoMemory();
+  }
+  for (Py_ssize_t i = 0; i < length; ++i) {
+    nodes[i] = PyMem_RawCalloc(1, sizeof(wsq_test_node_t));
+    if (nodes[i] == NULL) {
+      for (Py_ssize_t j = 0; j < i; ++j) {
+        PyMem_RawFree(nodes[j]);
+      }
+      PyMem_RawFree(nodes);
+      return PyErr_NoMemory();
+    }
+    nodes[i]->id = (int64_t)i;
+  }
+  // Link 0->1->...->L-1; tail's next stays NULL. Relaxed stores
+  // are fine — the segment is private to this thread until we hand
+  // it to enqueue_spread, which uses the queue's release/acquire
+  // protocol on its own.
+  for (Py_ssize_t i = 0; i < length - 1; ++i) {
+    boc_atomic_store_ptr_explicit(&nodes[i]->node.next_in_queue,
+                                  &nodes[i + 1]->node, BOC_MO_RELAXED);
+  }
+  boc_atomic_store_ptr_explicit(&nodes[length - 1]->node.next_in_queue, NULL,
+                                BOC_MO_RELAXED);
+  boc_bq_segment_t seg;
+  seg.start = &nodes[0]->node;
+  seg.end = &nodes[length - 1]->node.next_in_queue;
+  PyMem_RawFree(nodes);
+
+  boc_wsq_enqueue_spread(w, seg);
+
+  PyObject *out = PyList_New((Py_ssize_t)BOC_WSQ_N);
+  if (out == NULL) {
+    return NULL;
+  }
+  for (size_t i = 0; i < (size_t)BOC_WSQ_N; ++i) {
+    size_t count = 0;
+    boc_bq_node_t *raw;
+    while ((raw = boc_bq_dequeue(&w->q[i])) != NULL) {
+      wsq_test_node_t *n = (wsq_test_node_t *)raw;
+      PyMem_RawFree(n);
+      count += 1u;
+    }
+    PyObject *v = PyLong_FromSize_t(count);
+    if (v == NULL) {
+      Py_DECREF(out);
+      return NULL;
+    }
+    PyList_SET_ITEM(out, (Py_ssize_t)i, v);
+  }
+  return out;
+}
+
+// ---------------------------------------------------------------------------
+// Registrar
+// ---------------------------------------------------------------------------
+
+static PyMethodDef wsq_methods[] = {
+    {"wsq_n", wsq_n, METH_NOARGS,
+     "Return the compile-time BOC_WSQ_N constant."},
+    {"wsq_make_worker", wsq_make_worker, METH_NOARGS,
+     "Allocate and initialise a fresh boc_sched_worker_t fixture; "
+     "returns a capsule. The fixture's mutex/cv/ring fields are zero "
+     "(unused by WSQ helpers)."},
+    {"wsq_pre_inc_histogram", wsq_pre_inc_histogram, METH_VARARGS,
+     "Run k pre-increments on a fresh cursor; return a length-N list of "
+     "per-index counts."},
+    {"wsq_post_dec_sequence", wsq_post_dec_sequence, METH_VARARGS,
+     "Run k post-decrements on a fresh cursor; return the sequence of "
+     "returned indices as a list of length k."},
+    {"wsq_enqueue_drain_counts", wsq_enqueue_drain_counts, METH_VARARGS,
+     "Push k nodes via boc_wsq_enqueue, drain every sub-queue, return "
+     "per-sub-queue counts."},
+    {"wsq_spread_segment_counts", wsq_spread_segment_counts, METH_VARARGS,
+     "Build a length-L pre-linked segment, call boc_wsq_enqueue_spread, "
+     "drain every sub-queue, return per-sub-queue counts."},
+    {NULL, NULL, 0, NULL},
+};
+
+int boc_internal_test_register_wsq(PyObject *module) {
+  if (PyModule_AddFunctions(module, wsq_methods) < 0) {
+    return -1;
+  }
+  return 0;
+}
diff --git a/src/bocpy/_math.c b/src/bocpy/_math.c
index b8607be..058b105 100644
--- a/src/bocpy/_math.c
+++ b/src/bocpy/_math.c
@@ -7,108 +7,11 @@
 #include <stdlib.h>
 #include <string.h>
 
-#if PY_VERSION_HEX >= 0x030D0000
-#define Py_BUILD_CORE
-#include <internal/pycore_crossinterp.h>
-#endif
-
-#ifdef _WIN32
-#define WIN32_LEAN_AND_MEAN
-#include <windows.h>
-typedef volatile int_least64_t atomic_int_least64_t;
-
-int_least64_t atomic_fetch_add(atomic_int_least64_t *ptr, int_least64_t value) {
-  return InterlockedExchangeAdd64(ptr, value);
-}
-
-bool atomic_compare_exchange_strong(atomic_int_least64_t *ptr,
-                                    atomic_int_least64_t *expected,
-                                    int_least64_t desired) {
-  int_least64_t prev;
-  prev = InterlockedCompareExchange64(ptr, desired, *expected);
-  if (prev == *expected) {
-    return true;
-  }
-
-  *expected = prev;
-  return false;
-}
+#include "compat.h"
+#include "xidata.h"
 
-int_least64_t atomic_load(atomic_int_least64_t *ptr) { return *ptr; }
-
-int_least64_t atomic_exchange(atomic_int_least64_t *ptr, int_least64_t value) {
-  return InterlockedExchange64(ptr, value);
-}
-
-void atomic_store(atomic_int_least64_t *ptr, int_least64_t value) {
-  *ptr = value;
-}
-
-#define thread_local __declspec(thread)
-
-#else
+#ifndef _WIN32
 #include <math.h>
-#include <stdatomic.h>
-#endif
-
-#if defined __APPLE__
-#define thrd_sleep nanosleep
-#define thread_local _Thread_local
-#elif defined _WIN32
-#else
-#include <threads.h>
-#endif
-
-#if PY_VERSION_HEX >= 0x030E0000 // 3.14
-
-#define XIDATA_INIT _PyXIData_Init
-#define XIDATA_REGISTERCLASS(type, cb)                                         \
-  _PyXIData_RegisterClass(PyThreadState_GET(), (type),                         \
-                          (_PyXIData_getdata_t){.basic = (cb)})
-#define XIDATA_T _PyXIData_t
-
-#elif PY_VERSION_HEX >= 0x030D0000 // 3.13
-
-#define XIDATA_INIT _PyCrossInterpreterData_Init
-#define XIDATA_REGISTERCLASS(type, cb)                                         \
-  _PyCrossInterpreterData_RegisterClass((type), (crossinterpdatafunc)(cb))
-#define XIDATA_T _PyCrossInterpreterData
-
-#elif PY_VERSION_HEX >= 0x030C0000 // 3.12
-
-#define XIDATA_INIT _PyCrossInterpreterData_Init
-#define XIDATA_REGISTERCLASS(type, cb)                                         \
-  _PyCrossInterpreterData_RegisterClass((type), (crossinterpdatafunc)(cb))
-#define XIDATA_T _PyCrossInterpreterData
-
-#else
-
-#define BOC_NO_MULTIGIL
-
-#define XIDATA_REGISTERCLASS(type, cb)                                         \
-  _PyCrossInterpreterData_RegisterClass((type), (crossinterpdatafunc)(cb))
-#define XIDATA_T _PyCrossInterpreterData
-
-static void xidata_init(XIDATA_T *data, PyInterpreterState *interp,
-                        void *shared, PyObject *obj,
-                        PyObject *(*new_object)(_PyCrossInterpreterData *)) {
-  assert(data->data == NULL);
-  assert(data->obj == NULL);
-  *data = (_PyCrossInterpreterData){0};
-  data->interp = -1;
-
-  assert(data != NULL);
-  assert(new_object != NULL);
-  data->data = shared;
-  if (obj != NULL) {
-    assert(interp != NULL);
-    data->obj = Py_NewRef(obj);
-  }
-  data->interp = (interp != NULL) ? PyInterpreterState_GetID(interp) : -1;
-  data->new_object = new_object;
-}
-#define XIDATA_INIT xidata_init
-
 #endif
 
 /// @brief Convenience method to obtain the interpreter ID
diff --git a/src/bocpy/behaviors.py b/src/bocpy/behaviors.py
index f28b4b9..16d91a6 100644
--- a/src/bocpy/behaviors.py
+++ b/src/bocpy/behaviors.py
@@ -14,9 +14,7 @@
 import inspect
 import logging
 import os
-import shutil
 import sys
-import tempfile
 from textwrap import dedent
 import threading
 import time
@@ -71,7 +69,7 @@ class Cown(Generic[T]):
 
     def __init__(self, value: T):
         """Create a cown."""
-        logging.debug(f"initialising Cown with value: {value}")
+        logging.debug("initialising Cown with value: %r", value)
         if isinstance(value, _core.CownCapsule):
             self.impl = value
         else:
@@ -153,20 +151,14 @@ def __repr__(self) -> str:
 class Behaviors:
     """Coordinator that starts workers and schedules behaviors."""
 
-    def __init__(self, num_workers: Optional[int], export_dir: Optional[str]):
+    def __init__(self, num_workers: Optional[int]):
         """Creates a new Behaviors runtime.
 
         :param num_workers: The number of worker interpreters to start.  If
             None, defaults to the number of available cores minus one.
         :type num_workers: Optional[int]
-        :param export_dir: The directory to which the target module will be
-            exported for worker import.  If None, a temporary directory will
-            be created and removed on shutdown.
-        :type export_dir: Optional[str]
         """
         self.num_workers = WORKER_COUNT if num_workers is None else num_workers
-        self.export_dir = export_dir
-        self.export_tmp = export_dir is None
         self.worker_script = None
         self.classes = set()
         self.worker_threads = []
@@ -179,14 +171,35 @@ def __init__(self, num_workers: Optional[int], export_dir: Optional[str]):
         self.noticeboard = None
         self._noticeboard_start_error: Optional[BaseException] = None
         # Set to True by stop() once worker shutdown, noticeboard
-        # tear-down, and tempdir cleanup have all completed. The
-        # warned-stop / drain-error raise from stop() happens *after*
-        # this flips, so wait()/__exit__ can use the flag to
-        # distinguish "stop() raised but the runtime is dead -- clear
-        # the global handle" from "stop() raised mid-teardown and the
-        # runtime is still alive -- retain the handle so the caller
-        # can retry stop()".
+        # tear-down, and the C-level noticeboard slot release have
+        # all completed. The warned-stop / drain-error raise from
+        # stop() happens *after* this flips, so wait()/__exit__ can
+        # use the flag to distinguish "stop() raised but the runtime
+        # is dead -- clear the global handle" from "stop() raised
+        # mid-teardown and the runtime is still alive -- retain the
+        # handle so the caller can retry stop()".
         self._teardown_complete = False
+        # Populated by stop_workers() with any release_all() failures
+        # observed during the per-task-queue orphan drain. stop()
+        # consumes the list and clears it; on a clean stop this stays
+        # empty.
+        self._stop_drain_errors: list[BaseException] = []
+        # Set True when stop_workers() has run to completion (whether
+        # from the clean path or the noticeboard-timeout branch). A
+        # subsequent stop() retry must NOT re-invoke stop_workers --
+        # the worker pool is gone and `_core.scheduler_request_stop_all`
+        # would block forever waiting for shutdown replies that never
+        # come. The retry path skips straight to the noticeboard
+        # cleanup that the prior attempt could not complete.
+        self._workers_stopped = False
+        # Per-worker scheduler_stats() snapshot captured at the moment
+        # workers have replied "shutdown" but BEFORE
+        # `_core.scheduler_runtime_stop()` frees the per-worker array.
+        # Surfaced to the caller via `wait(stats=True)`. ``None`` means
+        # no snapshot was captured (e.g. start_workers failed before any
+        # worker registered, or stop_workers raised before reaching the
+        # capture point).
+        self._final_stats: Optional[list[dict]] = None
         self.final_cowns: tuple[Cown, ...] = ()
         self.bid = 0
 
@@ -250,13 +263,13 @@ def stop_workers(self):
             for name in list(frame.f_globals):
                 val = frame.f_globals[name]
                 if isinstance(val, Cown) or isinstance(val, _core.CownCapsule):
-                    self.logger.debug(f"acquiring {name}")
+                    self.logger.debug("acquiring %s", name)
                     val.acquire()
 
             for name in list(frame.f_locals):
                 val = frame.f_locals[name]
                 if isinstance(val, Cown) or isinstance(val, _core.CownCapsule):
-                    self.logger.debug(f"acquiring {name}")
+                    self.logger.debug("acquiring %s", name)
                     val.acquire()
 
             frame = frame.f_back
@@ -265,17 +278,78 @@ def stop_workers(self):
             cown.acquire()
 
         self.logger.debug("stopping workers")
-        for _ in range(self.num_workers):
-            _core.send("boc_worker", "shutdown")
-
-        for _ in range(self.num_workers):
-            _, contents = _core.receive("boc_behavior")
-            assert contents == "shutdown"
+        # Single C-level fan-out: flips stop_requested on every
+        # worker and signals each cv. Each worker observes the
+        # flag inside scheduler_worker_pop, exits its do_work loop,
+        # and sends "shutdown" back on boc_behavior.
+        #
+        # Once `scheduler_request_stop_all()` has been called the
+        # worker pool is committed to shutting down: re-entering this
+        # function on a retry would issue a second fan-out and then
+        # block forever in `receive("boc_behavior")` waiting for
+        # shutdown replies from workers that have already replied (or
+        # exited). Wrap everything past the fan-out in try/finally
+        # that pins `_workers_stopped = True` so any exception from
+        # the handshake, teardown, drain, or runtime_stop still
+        # routes a subsequent stop() down the retry-only branch.
+        #
+        # The retry-only branch in `stop()` does NOT itself call
+        # `scheduler_runtime_stop`, so we must guarantee it runs here
+        # even when the handshake / teardown / drain above raised --
+        # otherwise the per-worker `WORKERS` array leaks until the
+        # next `start()`. The C-side stop is idempotent (covered by
+        # `test_scheduler_runtime_stop_is_idempotent`), so running it
+        # unconditionally inside `finally` is safe.
+        _core.scheduler_request_stop_all()
+        try:
+            for _ in range(self.num_workers):
+                _, contents = _core.receive("boc_behavior")
+                assert contents == "shutdown"
 
-        for _ in range(self.num_workers):
-            _core.send("boc_cleanup", True)
+            for _ in range(self.num_workers):
+                _core.send("boc_cleanup", True)
 
-        self.teardown_workers()
+            self.teardown_workers()
+            # Drain any behaviours that were dispatched but never
+            # consumed (warned path of stop(), or any race where a
+            # late behaviour landed in a per-task queue between
+            # request_stop_all and the worker's pop_slow returning
+            # NULL). MUST run BEFORE scheduler_runtime_stop, which
+            # frees the worker array and the per-task queues with it.
+            # release_all on a drained behaviour may dispatch its
+            # successor; loop until the queues stay empty.
+            self._stop_drain_errors = self._drain_orphan_behaviors()
+        finally:
+            try:
+                # Snapshot the per-worker scheduler counters before
+                # the per-worker array is freed. Workers have already
+                # replied "shutdown" and exited their do_work loops,
+                # so their counters are stable. Surfaced to the
+                # caller via `wait(stats=True)`. Best-effort: any
+                # failure here must not block teardown.
+                try:
+                    self._final_stats = _core.scheduler_stats()
+                except Exception as snap_ex:
+                    self.logger.warning(
+                        "stop_workers(): failed to snapshot scheduler_stats: %r",
+                        snap_ex,
+                    )
+                    self._final_stats = None
+                # Free the per-worker scheduler array now that no
+                # worker thread can observe it. Paired with the
+                # `scheduler_runtime_start` call in `start()`. Run
+                # inside the outer `finally` so the WORKERS array is
+                # reclaimed even when an earlier step raised --
+                # without this the retry-only branch in `stop()`
+                # would never reach this call site.
+                _core.scheduler_runtime_stop()
+            finally:
+                # Mark workers as stopped so a retried stop() (after
+                # the noticeboard-timeout branch raises, or after a
+                # failure anywhere in the handshake/teardown/drain
+                # above) does not try to shut down a worker pool that
+                # is already gone.
+                self._workers_stopped = True
         self.logger.debug("workers stopped")
 
     def start_noticeboard(self):
@@ -416,97 +490,171 @@ def start(self, module: Optional[tuple[str, str]] = None):
             export = export_module_from_file(module[1])
             module_name = f"{module[0]}"
 
-        if self.export_dir is None:
-            self.export_dir = tempfile.mkdtemp()
-            self.export_tmp = True
+        # Defence in depth: the transpiler emits identifier-shaped
+        # names, but `module_name` is interpolated into worker
+        # bootstrap source -- reject anything that is not a valid
+        # dotted Python module path at the boundary so a hostile or
+        # malformed name cannot reach the `repr()`-protected
+        # interpolation below. Dotted names (``pkg.sub.mod``) are
+        # accepted because users may invoke bocpy from a
+        # package-qualified module; each dotted component must
+        # itself be a valid identifier. ``__main__`` falls through
+        # naturally because ``"__main__".isidentifier()`` is True
+        # and ``"__main__".split(".") == ["__main__"]``.
+        if not all(part.isidentifier() for part in module_name.split(".")):
+            raise ValueError(
+                f"module_name must be a dotted Python module path; "
+                f"got {module_name!r}"
+            )
 
         self.behavior_lookup = export.behaviors
-        path = os.path.join(self.export_dir, f"{module_name}.py")
-        with open(path, "w", encoding="utf-8") as file:
-            file.write(export.code)
+
+        # Embed the transpiled source as a Python string literal
+        # (via ``repr()``) into the worker bootstrap. Each worker
+        # compiles and exec's the literal into a fresh
+        # ``types.ModuleType``; no file is written to disk. The
+        # synthetic filename ``<bocpy:NAME>`` is registered with
+        # ``linecache`` so tracebacks still surface the transpiled
+        # source line. Every interpolated occurrence of the module
+        # name uses ``repr(module_name)`` so quote / backslash /
+        # non-ASCII content cannot break out of the string literal
+        # (the prior path interpolated ``module_name`` raw via
+        # f-string into ``r"..."``).
+        src_literal = repr(export.code)
+        bocmain_alias = "__bocmain__" if module_name == "__main__" else module_name
+        sysmod_key = repr(bocmain_alias)
+        linecache_key = repr(f"<bocpy:{bocmain_alias}>")
 
         main_start = worker_script.find(WORKER_MAIN_END)
 
+        bootstrap = [
+            "import linecache",
+            "import types",
+            f"_bocpy_src = {src_literal}",
+            f"_bocpy_mod = types.ModuleType({sysmod_key})",
+            f"_bocpy_mod.__file__ = {linecache_key}",
+            (
+                "linecache.cache["
+                f"{linecache_key}"
+                "] = (len(_bocpy_src), None, "
+                "_bocpy_src.splitlines(keepends=True), "
+                f"{linecache_key})"
+            ),
+            (
+                "exec(compile(_bocpy_src, "
+                f"{linecache_key}, 'exec'), _bocpy_mod.__dict__)"
+            ),
+            f"sys.modules[{sysmod_key}] = _bocpy_mod",
+            "boc_export = _bocpy_mod",
+        ]
+
         if module_name == "__main__":
-            lines = [f'load_boc_module("__bocmain__", r"{path}")', 'boc_export = sys.modules["__bocmain__"]']
             sys.modules["__bocmain__"] = sys.modules["__main__"]
             for cls in export.classes:
-                lines.append(f'\n\nclass {cls}(sys.modules["__bocmain__"].{cls}):')
-                lines.append("    pass")
-        else:
-            lines = [f'load_boc_module("{module_name}", r"{path}")', f'boc_export = sys.modules["{module_name}"]']
-
-        lines.append("")
-
-        self.worker_script = worker_script[:main_start] + "\n".join(lines) + worker_script[main_start:]
-
-        set_tags(["boc_behavior", "boc_worker", "boc_cleanup", "boc_noticeboard"])
-        # Bring up workers and the noticeboard thread first. We seed
-        # the C-level terminator only after both succeed so a failure
-        # in start_noticeboard (or anywhere between here and the
-        # terminator_reset below) leaves the terminator in its
-        # post-stop() quiescent state (count=0, seeded=0) and the
-        # next start() can proceed cleanly without a drift diagnostic
-        # firing. On a partial-startup failure we also tear the
-        # workers back down so the subsequent start() is not blocked
-        # by stale shutdown handshakes or dangling sub-interpreters.
-        self.start_workers()
+                bootstrap.append(f'\n\nclass {cls}(sys.modules["__bocmain__"].{cls}):')
+                bootstrap.append("    pass")
+
+        bootstrap.append("")
+
+        self.worker_script = (
+            worker_script[:main_start]
+            + "\n".join(bootstrap)
+            + worker_script[main_start:]
+        )
+
+        set_tags(["boc_behavior", "boc_cleanup", "boc_noticeboard"])
+        # Allocate the per-worker scheduler array before spawning any
+        # workers so each worker's first action (registering its slot)
+        # has a non-empty WORKERS array to claim from. Mirrored by
+        # `_core.scheduler_runtime_stop()` in `stop_workers()` after
+        # the workers are joined, and by every abort path below so
+        # the C-side WORKERS array is reclaimed and the next
+        # `start()` does not observe stale per-task queues.
+        _core.scheduler_runtime_start(self.num_workers)
         try:
-            self.start_noticeboard()
+            # Bring up workers and the noticeboard thread first. We seed
+            # the C-level terminator only after both succeed so a failure
+            # in start_noticeboard (or anywhere between here and the
+            # terminator_reset below) leaves the terminator in its
+            # post-stop() quiescent state (count=0, seeded=0) and the
+            # next start() can proceed cleanly without a drift diagnostic
+            # firing. On a partial-startup failure we also tear the
+            # workers back down so the subsequent start() is not blocked
+            # by stale shutdown handshakes or dangling sub-interpreters.
+            self.start_workers()
+            try:
+                self.start_noticeboard()
+            except BaseException:
+                # Close the terminator first so any sibling thread that
+                # somehow races a whencall during the abort window is
+                # refused at terminator_inc rather than slipping a real
+                # behavior into a per-task queue between our scheduler
+                # stop request and the worker shutdown handshake.
+                # TERMINATOR_CLOSED is 0 on the very first start() of
+                # the process and 1 after any prior stop()/abort;
+                # either way, set it to 1 explicitly. terminator_close()
+                # is idempotent.
+                _core.terminator_close()
+                self._abort_workers()
+                raise
+
+            # Arm the C-level terminator (count=1 seed, closed=0, seeded=1).
+            # reset() returns the prior (count, seeded) so we can detect a
+            # previous run that died without reaching its reconciliation
+            # point (KeyboardInterrupt, stop() that raised, etc.). We refuse
+            # to start on drift rather than silently clobbering whatever
+            # state was left behind -- the previous run is still leaking
+            # behaviors or cowns and starting fresh would mask the bug.
+            prior_count, prior_seeded = _core.terminator_reset()
+            if prior_count != 0 or prior_seeded != 0:
+                # We just armed the terminator (count=1, seeded=1, closed=0).
+                # Close it FIRST so any sibling thread that races a
+                # whencall during the abort window is refused before
+                # touching the half-shut-down pool. Then drop our own
+                # seed via terminator_seed_dec so the next start() sees
+                # (count=0, seeded=0) instead of re-firing the same
+                # drift diagnostic forever. Finally tear down workers
+                # and the noticeboard so the next start() can re-spawn
+                # without colliding with the orphans.
+                _core.terminator_close()
+                _core.terminator_seed_dec()
+                self._abort_noticeboard()
+                self._abort_workers()
+                raise RuntimeError(
+                    "terminator drift carried over from a previous run "
+                    f"(prior_count={prior_count}, prior_seeded={prior_seeded}). "
+                    "This indicates a leaked whencall, a stop() that raised "
+                    "before reconciliation, or an interrupted teardown. "
+                    "Resolve the earlier failure before starting again."
+                )
         except BaseException:
-            # Close the terminator first so any sibling thread that
-            # somehow races a whencall during the abort window is
-            # refused at terminator_inc rather than slipping a real
-            # behavior into boc_worker between our shutdown sentinels.
-            # TERMINATOR_CLOSED is 0 on the very first start() of the
-            # process and 1 after any prior stop()/abort; either way,
-            # set it to 1 explicitly. terminator_close() is idempotent.
-            _core.terminator_close()
-            self._abort_workers()
+            # Defence in depth: if any abort path above failed to call
+            # `_core.scheduler_runtime_stop` (or if `start_workers`
+            # raised before reaching the inner try), free the C-side
+            # WORKERS array here. `scheduler_runtime_stop` is
+            # idempotent — calling it twice on a successful abort is
+            # a no-op on the second call.
+            try:
+                _core.scheduler_runtime_stop()
+            except Exception as ex:
+                self.logger.exception(ex)
+            # Drop the __bocmain__ alias if we installed one, so a
+            # follow-up start() observes a clean sys.modules. Same
+            # rationale as in the successful stop() path.
+            sys.modules.pop("__bocmain__", None)
             raise
 
-        # Arm the C-level terminator (count=1 seed, closed=0, seeded=1).
-        # reset() returns the prior (count, seeded) so we can detect a
-        # previous run that died without reaching its reconciliation
-        # point (KeyboardInterrupt, stop() that raised, etc.). We refuse
-        # to start on drift rather than silently clobbering whatever
-        # state was left behind -- the previous run is still leaking
-        # behaviors or cowns and starting fresh would mask the bug.
-        prior_count, prior_seeded = _core.terminator_reset()
-        if prior_count != 0 or prior_seeded != 0:
-            # We just armed the terminator (count=1, seeded=1, closed=0).
-            # Close it FIRST so any sibling thread that races a
-            # whencall during the abort window is refused before
-            # touching the half-shut-down pool. Then drop our own
-            # seed via terminator_seed_dec so the next start() sees
-            # (count=0, seeded=0) instead of re-firing the same
-            # drift diagnostic forever. Finally tear down workers
-            # and the noticeboard so the next start() can re-spawn
-            # without colliding with the orphans.
-            _core.terminator_close()
-            _core.terminator_seed_dec()
-            self._abort_noticeboard()
-            self._abort_workers()
-            raise RuntimeError(
-                "terminator drift carried over from a previous run "
-                f"(prior_count={prior_count}, prior_seeded={prior_seeded}). "
-                "This indicates a leaked whencall, a stop() that raised "
-                "before reconciliation, or an interrupted teardown. "
-                "Resolve the earlier failure before starting again."
-            )
-
     def _abort_workers(self):
         """Tear down the worker pool after a partial-startup failure.
 
-        Sends the same ``("boc_worker", "shutdown")`` / cleanup
+        Issues the same ``scheduler_request_stop_all`` + cleanup
         handshake as :py:meth:`stop_workers` but without the cown
         round-up, which is unsafe before the runtime is fully alive.
         Used only on the error path of :py:meth:`start`; on the normal
         path :py:meth:`stop_workers` performs the equivalent work.
         """
         self.logger.debug("aborting workers after failed startup")
-        for _ in range(self.num_workers):
-            _core.send("boc_worker", "shutdown")
+        _core.scheduler_request_stop_all()
         for _ in range(self.num_workers):
             try:
                 _, contents = _core.receive("boc_behavior")
@@ -544,12 +692,26 @@ def stop(self, timeout: Optional[float] = None):
         """Quiesce all behaviors and tear the runtime down.
 
         :param timeout: Upper bound on the **quiescence** and
-            **noticeboard-drain** phases (steps 1, 2, and 4 below). The
-            worker shutdown handshake (step 5), orphan-behavior drain,
-            and tempdir cleanup that follow run to completion regardless;
-            ``timeout`` does not bound total ``stop()`` runtime. ``None``
-            means wait forever for quiescence.
+            **noticeboard-drain** phases (steps 1, 2, and 4 below).
+            The worker shutdown handshake (step 5) and orphan-behavior
+            drain that follow run to completion regardless;
+            ``timeout`` does not bound total ``stop()`` runtime.
+            ``None`` means wait forever for quiescence. Values above
+            ``1e9`` seconds (~31.7 years) are clamped to wait-forever
+            to avoid platform ``time_t`` / ``DWORD`` overflow inside
+            the underlying condition-variable wait.
         :type timeout: Optional[float]
+        :raises RuntimeError: If the noticeboard thread does not exit
+            before the timeout (or, on a retry call, is still alive).
+            The first failure carries the message prefix
+            ``"noticeboard thread did not shut down within timeout=..."``;
+            subsequent retry failures carry
+            ``"noticeboard thread still pinned on retry ..."``.
+            Workers and the orphan-behavior drain have already
+            completed by the time either is raised, so the runtime
+            is intentionally left re-drivable: callers may retry
+            ``stop()`` / ``wait()`` once the in-flight noticeboard
+            mutation finishes.
 
         With no central scheduler thread, ``stop()`` drives
         the C terminator directly. The sequence is:
@@ -564,12 +726,17 @@ def stop(self, timeout: Optional[float] = None):
         4. Tear down the noticeboard thread (it must have drained any
            in-flight messages from the last behaviors before the
            single-writer slot is released).
-        5. Stop workers and clean up the export tempdir.
+        5. Stop workers and release the C-level noticeboard slot.
 
         After ``terminator_wait`` returns we assert ``terminator_count
         == 0 and terminator_seeded == 0``; any non-zero value indicates
         a bookkeeping bug (a missed decrement, or a scheduling-after-
         wait that slipped past ``terminator_close``).
+
+        The retry path is internally gated on ``_workers_stopped`` so
+        the worker pool is not torn down twice; a second ``stop()``
+        after a noticeboard-timeout abort retries only the
+        noticeboard drain.
         """
         # Take down the seed and wait for quiescence. Both
         # are idempotent so a second stop() / wait() is a no-op.
@@ -588,70 +755,123 @@ def _remaining():
                 return None
             return max(0.0, deadline - time.monotonic())
 
-        _core.terminator_seed_dec()
-        _core.terminator_wait(_remaining())
-
-        # Post-wait reconciliation. If wait() timed out the count is
-        # still > 0 -- skip the assertion in that case so a partial
-        # teardown does not mask the underlying timeout.
-        c_count = _core.terminator_count()
-        c_seeded = _core.terminator_seeded()
-        quiesced = (c_count == 0 and c_seeded == 0)
-        # Close the terminator unconditionally before any further drain
-        # work. On the clean path this is the documented refusal point;
-        # on the warned path it MUST happen before _drain_orphan_behaviors
-        # so a late whencall caller cannot slip a fresh BehaviorCapsule
-        # into boc_worker between the drain's last receive() and the
-        # cleanup that follows. terminator_close() is idempotent.
-        _core.terminator_close()
-        if not quiesced:
-            self.logger.warning(
-                "stop(): terminator did not reach quiescence "
-                f"(count={c_count}, seeded={c_seeded}). "
-                "This typically means stop() was invoked with a timeout "
-                "that elapsed while behaviors were still in flight."
-            )
+        # Idempotent retry: if a prior stop() reached the
+        # noticeboard-timeout branch, it already drove the
+        # terminator to quiescence and shut the workers down.
+        # Re-running ``stop_workers`` would block forever in
+        # ``scheduler_request_stop_all`` waiting for shutdown
+        # replies from a worker pool that is gone. Skip straight
+        # to the noticeboard cleanup the prior attempt could not
+        # complete.
+        if not self._workers_stopped:
+            _core.terminator_seed_dec()
+            _core.terminator_wait(_remaining())
+
+            # Post-wait reconciliation. If wait() timed out the count is
+            # still > 0 -- skip the assertion in that case so a partial
+            # teardown does not mask the underlying timeout.
+            c_count = _core.terminator_count()
+            c_seeded = _core.terminator_seeded()
+            quiesced = (c_count == 0 and c_seeded == 0)
+            # Close the terminator unconditionally before any further drain
+            # work. On the clean path this is the documented refusal point;
+            # on the warned path it MUST happen before stop_workers's
+            # orphan drain so a late whencall caller cannot slip a fresh
+            # behavior into a per-task queue between the drain pass and
+            # scheduler_runtime_stop. terminator_close() is idempotent.
+            _core.terminator_close()
+            if not quiesced:
+                self.logger.warning(
+                    "stop(): terminator did not reach quiescence "
+                    f"(count={c_count}, seeded={c_seeded}). "
+                    "This typically means stop() was invoked with a timeout "
+                    "that elapsed while behaviors were still in flight."
+                )
 
-        # Drain the noticeboard thread.
-        _core.send("boc_noticeboard", "shutdown")
-        self.noticeboard.join(_remaining())
-        if self.noticeboard.is_alive():
-            # join() timed out. Do not proceed to stop_workers / cleanup:
-            # the noticeboard thread still owns the single-writer slot
-            # and may be holding NB_MUTEX while processing an in-flight
-            # mutation. Tearing workers down under it would be racy.
-            raise RuntimeError(
-                "stop(): noticeboard thread did not shut down within "
-                f"timeout={timeout!r}. The runtime is left running so "
-                "the leak can be diagnosed; a later stop() call may "
-                "succeed once the in-flight mutation completes."
-            )
-        # Shut workers down and reset noticeboard ownership.
-        self.stop_workers()
-        # Defensive drain: if stop() entered the "terminator did not
-        # quiesce" branch above (or any late whencall slipped in
-        # between terminator_close and the worker shutdown messages),
-        # behaviors may still sit in boc_worker with their MCS links
-        # pinned. Release them inline so we do not leak cowns on a
-        # warned-only stop, and drop the terminator holds the whencall
-        # callers took. With a clean stop this is a no-op.
-        drain_errors = self._drain_orphan_behaviors()
+            # Drain the noticeboard thread.
+            _core.send("boc_noticeboard", "shutdown")
+            self.noticeboard.join(_remaining())
+            if self.noticeboard.is_alive():
+                # join() timed out. The noticeboard thread still owns the
+                # single-writer slot and may be holding NB_MUTEX while
+                # processing an in-flight mutation. We do not call
+                # `clear_noticeboard_thread` / `noticeboard_clear` (those
+                # would race with the live thread), but we MUST still drain
+                # orphan behaviors so the C-side terminator_count returns
+                # to 0 — otherwise a caller-supplied finite timeout that
+                # fires here permanently strands every behavior currently
+                # parked in a per-task queue. Worker shutdown itself does
+                # not touch NB_MUTEX, so it is safe under a wedged
+                # noticeboard thread.
+                try:
+                    self.stop_workers()
+                except Exception as drain_ex:
+                    # Surface drain failures via logging; the outer
+                    # RuntimeError below remains the primary failure
+                    # signal because the noticeboard timeout is what got
+                    # us into this branch.
+                    self.logger.exception(drain_ex)
+                # Reset the drain errors list so a subsequent stop() does
+                # not double-report; the drain has already happened.
+                self._stop_drain_errors = []
+                raise RuntimeError(
+                    "stop(): noticeboard thread did not shut down within "
+                    f"timeout={timeout!r}. Workers were shut down and "
+                    "orphan behaviors drained, but the noticeboard slot "
+                    "is still pinned; a later stop() call may complete "
+                    "the cleanup once the in-flight mutation finishes."
+                )
+            # Shut workers down and reset noticeboard ownership.
+            # stop_workers() now owns the orphan-drain (must happen before
+            # the per-task queues are freed); it stashes any release_all
+            # exceptions on `self._stop_drain_errors` for stop() to re-raise.
+            self.stop_workers()
+            drain_errors = self._stop_drain_errors
+            self._stop_drain_errors = []
+        else:
+            # Retry path: workers are already gone. Re-attempt the
+            # noticeboard drain that timed out previously. ``join()``
+            # without a timeout waits forever -- by this point the
+            # in-flight noticeboard fn must have finished or the
+            # caller is no closer to making progress than they were
+            # before. We surface the join via a remaining-budget
+            # join so a caller-supplied timeout still bounds the
+            # retry. The ``is_alive()`` check below is best-effort:
+            # if the thread has already exited it skips the
+            # redundant sentinel send. There is a residual TOCTOU
+            # window (alive at check, exits before the send lands)
+            # in which a stale sentinel can linger in the
+            # ``boc_noticeboard`` queue, but correctness rests on
+            # ``Behaviors.start_runtime`` calling ``set_tags(["...",
+            # "boc_noticeboard"])`` on the next ``start()``, which
+            # clears the queue per the public ``set_tags`` contract.
+            # The guard reduces the frequency of the stale-sentinel
+            # case but is not itself the correctness fence.
+            if self.noticeboard.is_alive():
+                _core.send("boc_noticeboard", "shutdown")
+            self.noticeboard.join(_remaining())
+            if self.noticeboard.is_alive():
+                # Still pinned. Re-raise the same diagnostic so the
+                # caller can keep retrying. ``_workers_stopped`` is
+                # unchanged so a subsequent retry stays on this path.
+                raise RuntimeError(
+                    "stop(): noticeboard thread still pinned on retry "
+                    f"(timeout={timeout!r}). The in-flight mutation "
+                    "has not finished; retry once it has."
+                )
+            drain_errors = []
         _core.clear_noticeboard_thread()
         _core.noticeboard_clear()
         # Teardown is complete: workers are joined, the noticeboard
-        # thread has exited, and the C-level slot is released. The
-        # tempdir cleanup that follows is bookkeeping; if it raises
-        # the runtime is still gone and wait()/__exit__ should null
-        # the global BEHAVIORS handle so the next @when starts fresh
-        # rather than retrying stop() on a dead instance.
+        # thread has exited, and the C-level slot is released.
+        # The transpiled module is exec'd in-memory in each worker,
+        # so there is no on-disk artifact to clean up.
         self._teardown_complete = True
-        if os.path.exists(self.export_dir) and self.export_tmp:
-            try:
-                shutil.rmtree(self.export_dir)
-            except Exception as ex:
-                # An orphan tempdir is annoying but not fatal: log and
-                # continue so the caller observes a normal stop().
-                self.logger.exception(ex)
+        # Drop the __bocmain__ alias we installed in start() so a
+        # subsequent bocpy.start() observes a clean sys.modules
+        # (and so the main module isn't pinned in sys.modules under
+        # an alias after the runtime has shut down).
+        sys.modules.pop("__bocmain__", None)
         if drain_errors:
             # Surface the first failure so the caller sees the leak at
             # the failure site rather than later as a mysterious
@@ -664,18 +884,31 @@ def _remaining():
             ) from drain_errors[0]
 
     def _drain_orphan_behaviors(self):
-        """Release any BehaviorCapsules left on ``boc_worker`` post-shutdown.
-
-        Called after :py:meth:`stop_workers`. Each orphan has had its
-        cowns scheduled (MCS links established) but never acquired by
-        a worker. ``release_all`` walks the MCS queues, hands off to any
-        waiting successors, and frees the request array; ``terminator_dec``
-        drops the hold the ``whencall`` caller took before
-        ``behavior_schedule``. The result Cown of each dropped behavior
-        is *not* mutated here: it has already been released (owner
-        ``NO_OWNER``, ``value`` is ``NULL``, ``xidata`` is set), and
-        writing into ``value`` would put it in a state ``cown_acquire``
-        cannot recover from on a subsequent runtime restart.
+        """Release any BehaviorCapsules left in per-worker queues post-shutdown.
+
+        Called from :py:meth:`stop_workers` after the worker threads
+        have joined but BEFORE :py:func:`_core.scheduler_runtime_stop`
+        frees the per-worker queues. Each orphan has had its cowns
+        scheduled (MCS links established) but never acquired by a
+        worker. ``release_all`` walks the MCS queues, hands off to any
+        waiting successors, and frees the request array;
+        ``terminator_dec`` drops the hold the ``whencall`` caller took
+        before ``behavior_schedule``.
+
+        Before ``release_all`` runs, ``set_drop_exception`` marks the
+        result Cown with a :class:`RuntimeError` so a caller awaiting
+        ``cown.value`` / ``cown.exception`` after :py:meth:`stop` sees
+        a diagnostic instead of a permanent ``None``. Mirrors the
+        worker exception path (:py:func:`worker.run_behavior`):
+        ``acquire`` → ``set_exception`` → ``release``, condensed into
+        one C call (`_core.c::BehaviorCapsule_set_drop_exception`).
+
+        ``release_all`` may dispatch a successor into the per-task
+        queues (the off-worker arm of ``boc_sched_dispatch`` runs
+        because the calling thread is the main thread, not a worker).
+        That successor will not be consumed -- workers are gone --
+        so the loop drains again until
+        ``scheduler_drain_all_queues`` returns an empty list.
 
         :returns: A list of exceptions captured from
             ``release_all`` failures, or ``[]`` on a clean
@@ -684,29 +917,59 @@ def _drain_orphan_behaviors(self):
             mysterious deadlock on the affected cowns.
         """
         errors = []
+        # KeyboardInterrupt / SystemExit raised mid-drain must not
+        # abort the drain partway -- the orphaned behaviors would
+        # leak their MCS chains and terminator holds, so the next
+        # start() would diagnose terminator drift forever. Capture
+        # them, finish the drain, and re-raise the first after the
+        # loop returns clean.
+        deferred_base_exc = None
         while True:
-            msg = _core.receive("boc_worker", timeout=0)
-            if msg[0] == _core.TIMEOUT:
+            capsules = _core.scheduler_drain_all_queues()
+            if not capsules:
+                if deferred_base_exc is not None:
+                    raise deferred_base_exc
                 return errors
-            payload = msg[1]
-            if isinstance(payload, _core.BehaviorCapsule):
+            for payload in capsules:
                 self.logger.warning(
                     "behavior dropped during stop(); the runtime was "
                     "torn down before this behavior could acquire its cowns"
                 )
+                # Surface the drop to anyone awaiting the result Cown.
+                # Best-effort: failures here only degrade UX (the user
+                # sees None instead of a diagnostic), so log and
+                # continue with release_all so MCS chains still
+                # unwind.
+                try:
+                    payload.set_drop_exception(RuntimeError(
+                        "behavior dropped during stop(); the runtime "
+                        "was torn down before this behavior could "
+                        "acquire its cowns"
+                    ))
+                except Exception as ex:
+                    self.logger.exception(ex)
+                except (KeyboardInterrupt, SystemExit) as ex:
+                    self.logger.exception(ex)
+                    if deferred_base_exc is None:
+                        deferred_base_exc = ex
                 try:
                     payload.release_all()
                 except Exception as ex:
                     self.logger.exception(ex)
                     errors.append(ex)
+                except (KeyboardInterrupt, SystemExit) as ex:
+                    self.logger.exception(ex)
+                    errors.append(ex)
+                    if deferred_base_exc is None:
+                        deferred_base_exc = ex
                 try:
                     _core.terminator_dec()
                 except Exception as ex:
                     self.logger.exception(ex)
-            # Non-capsule payloads (e.g. a stray "shutdown") are silently
-            # ignored. Worker shutdowns balance 1:1 with workers, so a
-            # stray sentinel here would already indicate a bug elsewhere;
-            # the loop body just falls through to the next receive().
+                except (KeyboardInterrupt, SystemExit) as ex:
+                    self.logger.exception(ex)
+                    if deferred_base_exc is None:
+                        deferred_base_exc = ex
 
     def __exit__(self, exc_type, exc_value, traceback):
         """Ensure stop is called on context exit."""
@@ -737,7 +1000,10 @@ def whencall(thunk: str, args: list[Union[Cown, list[Cown]]], captures: list[Any
         group_id += 1
 
     behavior = _core.BehaviorCapsule(thunk, result.impl, cowns, captures)
-    logging.debug(f"whencall:behavior=Behavior(thunk={thunk}, result={result}, args={args}, captures={captures})")
+    logging.debug(
+        "whencall:behavior=Behavior(thunk=%s, result=%r, args=%r, captures=%r)",
+        thunk, result, args, captures,
+    )
     # Caller threads run the entire 2PL inline. Register with the
     # C terminator first so a concurrent stop()/terminator_close() will
     # refuse the schedule rather than racing teardown. Once the
@@ -764,7 +1030,6 @@ def get_caller_module():
 
 
 def start(worker_count: Optional[int] = None,
-          export_dir: Optional[str] = None,
           module: Optional[tuple[str, str]] = None):
     """Start the behavior runtime: worker pool plus noticeboard thread.
 
@@ -774,10 +1039,6 @@ def start(worker_count: Optional[int] = None,
     :param worker_count: The number of worker interpreters to start.  If
         None, defaults to the number of available cores minus one.
     :type worker_count: Optional[int]
-    :param export_dir: The directory to which the target module will be
-        exported for worker import.  If None, a temporary directory will
-        be created and removed on shutdown.
-    :type export_dir: Optional[str]
     :param module: A tuple of the target module name and file path to export
         for worker import.  If None, the caller's module will be used.
     :type module: Optional[tuple[str, str]]
@@ -794,7 +1055,7 @@ def start(worker_count: Optional[int] = None,
 
     if module is None:
         module = get_caller_module()
-    BEHAVIORS = Behaviors(worker_count, export_dir)
+    BEHAVIORS = Behaviors(worker_count)
     try:
         BEHAVIORS.start(module)
     except BaseException:
@@ -834,7 +1095,7 @@ def when_factory(func):
             print(BEHAVIORS.behavior_lookup)
             return None
 
-        logging.debug(f"when:behavior={binfo}")
+        logging.debug("when:behavior=%s", binfo)
         captures = []
         for name in binfo.captures:
             frame = when_frame
@@ -866,24 +1127,43 @@ def when_factory(func):
     return when_factory
 
 
-def wait(timeout: Optional[float] = None):
-    """Block until all behaviors complete, with optional timeout."""
+def wait(timeout: Optional[float] = None, *, stats: bool = False):
+    """Block until all behaviors complete, with optional timeout.
+
+    When ``stats=True``, returns the per-worker
+    :func:`_core.scheduler_stats` snapshot captured at shutdown
+    (after all behaviors have run, before the per-worker array is
+    freed). When ``stats=False`` (the default), returns ``None``.
+    Returns ``[]`` if the runtime was never started or the snapshot
+    could not be captured.
+    """
     global BEHAVIORS
     if BEHAVIORS:
         # Clear BEHAVIORS only if stop() drove the runtime all the
         # way through teardown (workers joined, noticeboard exited,
-        # tempdir removed). On stop()'s noticeboard-join-timeout path
-        # the runtime is intentionally left running so the caller can
-        # diagnose the leak and retry; nulling the global handle
-        # there would strand the live workers / noticeboard thread
-        # with no Python-side reference.
+        # C-level noticeboard slot released). On stop()'s
+        # noticeboard-join-timeout path the runtime is intentionally
+        # left running so the caller can diagnose the leak and
+        # retry; nulling the global handle there would strand the
+        # live workers / noticeboard thread with no Python-side
+        # reference.
         try:
             BEHAVIORS.stop(timeout)
         except BaseException:
             if BEHAVIORS._teardown_complete:
+                snapshot = BEHAVIORS._final_stats
                 BEHAVIORS = None
+                if stats:
+                    return snapshot if snapshot is not None else []
             raise
+        snapshot = BEHAVIORS._final_stats
         BEHAVIORS = None
+        if stats:
+            return snapshot if snapshot is not None else []
+        return None
+    if stats:
+        return []
+    return None
 
 
 def _validate_noticeboard_key(key: str) -> None:
diff --git a/src/bocpy/compat.c b/src/bocpy/compat.c
new file mode 100644
index 0000000..92971ba
--- /dev/null
+++ b/src/bocpy/compat.c
@@ -0,0 +1,103 @@
+/// @file compat.c
+/// @brief Out-of-line definitions for the cross-platform shims declared in
+///        `compat.h`.
+///
+/// On POSIX the C11 `<stdatomic.h>` machinery is fully header-only, so this
+/// translation unit is essentially empty there. On MSVC the `atomic_*`
+/// functions on `int_least64_t` are kept as out-of-line definitions
+/// (linked into `_core.o` and `_math.o` from `compat.o`).
+
+#include "compat.h"
+
+#ifdef _WIN32
+
+int_least64_t atomic_fetch_add(atomic_int_least64_t *ptr, int_least64_t value) {
+  return InterlockedExchangeAdd64(ptr, value);
+}
+
+int_least64_t atomic_fetch_sub(atomic_int_least64_t *ptr, int_least64_t value) {
+  return InterlockedExchangeAdd64(ptr, -value);
+}
+
+bool atomic_compare_exchange_strong(atomic_int_least64_t *ptr,
+                                    atomic_int_least64_t *expected,
+                                    int_least64_t desired) {
+  int_least64_t prev;
+  prev = InterlockedCompareExchange64(ptr, desired, *expected);
+  if (prev == *expected) {
+    return true;
+  }
+
+  *expected = prev;
+  return false;
+}
+
+int_least64_t atomic_load(atomic_int_least64_t *ptr) { return *ptr; }
+
+int_least64_t atomic_exchange(atomic_int_least64_t *ptr, int_least64_t value) {
+  return InterlockedExchange64(ptr, value);
+}
+
+void atomic_store(atomic_int_least64_t *ptr, int_least64_t value) {
+  *ptr = value;
+}
+
+void thrd_sleep(const struct timespec *duration, struct timespec *remaining) {
+  const DWORD MS_PER_NS = 1000000;
+  DWORD ms = (DWORD)duration->tv_sec * 1000;
+  ms += (DWORD)duration->tv_nsec / MS_PER_NS;
+  Sleep(ms);
+}
+
+#endif // _WIN32
+
+double boc_now_s(void) {
+  const double S_PER_NS = 1.0e-9;
+  struct timespec ts;
+  // Prefer clock_gettime on POSIX: timespec_get requires macOS 10.15+ while
+  // Python's default macOS deployment target is older, producing an
+  // -Wunguarded-availability-new warning. clock_gettime has been available on
+  // macOS since 10.12. Windows UCRT provides timespec_get but not
+  // clock_gettime, so fall back there.
+#ifdef _WIN32
+  timespec_get(&ts, TIME_UTC);
+#else
+  clock_gettime(CLOCK_REALTIME, &ts);
+#endif
+  double time = (double)ts.tv_sec;
+  time += ts.tv_nsec * S_PER_NS;
+  return time;
+}
+
+uint64_t boc_now_ns(void) {
+#ifdef _WIN32
+  // QueryPerformanceCounter is monotonic and high-resolution on every
+  // Windows version we target; the frequency is queried once and
+  // cached because it is constant for the lifetime of the system.
+  static LARGE_INTEGER freq = {0};
+  if (freq.QuadPart == 0) {
+    QueryPerformanceFrequency(&freq);
+  }
+  LARGE_INTEGER counter;
+  QueryPerformanceCounter(&counter);
+  // Convert ticks -> ns without overflow on a 64-bit counter for any
+  // realistic frequency (<= 10 GHz): split into seconds + remainder.
+  uint64_t sec = (uint64_t)counter.QuadPart / (uint64_t)freq.QuadPart;
+  uint64_t rem = (uint64_t)counter.QuadPart % (uint64_t)freq.QuadPart;
+  return sec * 1000000000ULL + (rem * 1000000000ULL) / (uint64_t)freq.QuadPart;
+#else
+  struct timespec ts;
+  clock_gettime(CLOCK_MONOTONIC, &ts);
+  return (uint64_t)ts.tv_sec * 1000000000ULL + (uint64_t)ts.tv_nsec;
+#endif
+}
+
+void boc_sleep_ns(uint64_t ns) {
+  if (ns == 0) {
+    return;
+  }
+  struct timespec duration;
+  duration.tv_sec = (time_t)(ns / 1000000000ULL);
+  duration.tv_nsec = (long)(ns % 1000000000ULL);
+  thrd_sleep(&duration, NULL);
+}
diff --git a/src/bocpy/compat.h b/src/bocpy/compat.h
new file mode 100644
index 0000000..e0c3da6
--- /dev/null
+++ b/src/bocpy/compat.h
@@ -0,0 +1,935 @@
+/// @file compat.h
+/// @brief Cross-platform portability shims for bocpy C extensions.
+///
+/// Centralises the platform-specific atomic, mutex, condition-variable,
+/// thread-local, sleep, and monotonic-time primitives used by `_core.c`,
+/// `_math.c`, and `sched.c`.
+///
+/// **Linkage:** all heavy-weight platform primitives are exposed as
+/// `static inline` wrappers around the platform's native API, except for
+/// the MSVC `atomic_*` functions on `int_least64_t` (kept as out-of-line
+/// definitions in `compat.c` to preserve their original symbol shape).
+///
+/// Also exposes the `boc_atomic_*_explicit` typed atomics API that the
+/// work-stealing scheduler depends on for ARM64-correct memory ordering
+/// on Windows.
+///
+/// **File layout.** All platform-specific machinery is grouped behind a
+/// single top-level `#ifdef _WIN32 / #elif __APPLE__ / #else` ladder:
+///
+///   1. Cross-platform headers and the C11 alignas/alignof shim.
+///   2. Memory-order tags (`BOC_MO_*`) used by both arms of the typed
+///      atomics API below.
+///   3. **Windows arm** — Win32 headers, `atomic_*` polyfill on
+///      `int_least64_t` / `intptr_t`, BOC mutex/cond on `SRWLOCK` and
+///      `CONDITION_VARIABLE`, the typed `boc_atomic_*_explicit` API
+///      with x86/x64/ARM64 dispatch, `boc_yield`, and `thread_local`.
+///   4. **Apple arm** — `pthread`-based BOC mutex/cond with
+///      `<stdatomic.h>` typed atomics; `nanosleep` aliased to
+///      `thrd_sleep`.
+///   5. **Other POSIX (Linux) arm** — C11 `<threads.h>`-based BOC
+///      mutex/cond with `<stdatomic.h>` typed atomics.
+///   6. Cross-platform monotonic time / sleep helpers
+///      (`boc_now_s`, `boc_now_ns`, `boc_sleep_ns`).
+///   7. Cross-platform timeout-validation helper.
+
+#ifndef BOCPY_COMPAT_H
+#define BOCPY_COMPAT_H
+
+#define PY_SSIZE_T_CLEAN
+
+#include <Python.h>
+#include <stdbool.h>
+#include <stdint.h>
+#include <time.h>
+
+// ---------------------------------------------------------------------------
+// Cross-platform alignas / alignof shim
+// ---------------------------------------------------------------------------
+//
+// Portable C11-style alignment macros. MSVC's `<stdalign.h>` only
+// defines `alignas` / `alignof` when the compiler is invoked in C11
+// mode (`/std:c11` or later); the Python build does not pass that
+// flag, so we map directly to the underlying `__declspec(align(...))`
+// / `__alignof` intrinsics on MSVC and fall back to `<stdalign.h>`
+// elsewhere. C++ TUs always get the standard header.
+#if defined(__cplusplus)
+#include <stdalign.h>
+#elif defined(_MSC_VER)
+#if _MSC_VER >= 1900
+#ifndef alignas
+#define alignas(x) __declspec(align(x))
+#endif
+#ifndef alignof
+#define alignof(x) __alignof(x)
+#endif
+#else
+#error "MSVC >= 1900 required for alignas/alignof support"
+#endif
+#else
+#include <stdalign.h>
+#endif
+
+// ---------------------------------------------------------------------------
+// Memory-order tags
+// ---------------------------------------------------------------------------
+//
+// Used by the typed `boc_atomic_*_explicit` API below. Defined here
+// (above the platform fork) because both arms reference these tags.
+// Distinct integer constants so the MSVC dispatch can `switch` on
+// them; on POSIX they are mapped to `memory_order_*` by the
+// `boc_mo_to_std` helper inside the POSIX arm. Skip 1 to leave room
+// for `consume`.
+
+typedef enum {
+  BOC_MO_RELAXED = 0,
+  BOC_MO_ACQUIRE = 2,
+  BOC_MO_RELEASE = 3,
+  BOC_MO_ACQ_REL = 4,
+  BOC_MO_SEQ_CST = 5,
+} boc_memory_order_t;
+
+// ===========================================================================
+// Platform fork: Windows / Apple / other POSIX (Linux).
+// ===========================================================================
+
+#ifdef _WIN32
+
+// ---------------------------------------------------------------------------
+// Windows: headers, thread_local, yield
+// ---------------------------------------------------------------------------
+
+#define WIN32_LEAN_AND_MEAN
+#include <process.h>
+#include <windows.h>
+
+#define thread_local __declspec(thread)
+#define boc_yield() SwitchToThread()
+
+// ---------------------------------------------------------------------------
+// Windows: legacy `atomic_*` polyfill on int_least64_t / intptr_t
+// ---------------------------------------------------------------------------
+
+typedef volatile int_least64_t atomic_int_least64_t;
+typedef volatile intptr_t atomic_intptr_t;
+
+int_least64_t atomic_fetch_add(atomic_int_least64_t *ptr, int_least64_t value);
+int_least64_t atomic_fetch_sub(atomic_int_least64_t *ptr, int_least64_t value);
+bool atomic_compare_exchange_strong(atomic_int_least64_t *ptr,
+                                    atomic_int_least64_t *expected,
+                                    int_least64_t desired);
+int_least64_t atomic_load(atomic_int_least64_t *ptr);
+int_least64_t atomic_exchange(atomic_int_least64_t *ptr, int_least64_t value);
+void atomic_store(atomic_int_least64_t *ptr, int_least64_t value);
+
+// ----- atomic_intptr_t siblings ---------------------------------------------
+// The MSVC polyfill defines `atomic_intptr_t` and `atomic_int_least64_t` as
+// distinct typedefs; the plain `atomic_load` / `atomic_store` / etc. above
+// only accept `atomic_int_least64_t *`. Without these siblings, code that
+// touches an `atomic_intptr_t` field (e.g. BOCRequest::next, BOCCown::last,
+// BOCRecycleQueue::head, BOCQueue::tag, NB_NOTICEBOARD_TID) would silently
+// pass a mistyped pointer to the int64 polyfill on Windows. On POSIX C11 the
+// same names are aliased to the generic atomic_* macros (which already
+// dispatch on type via _Generic), so user code below is platform-uniform.
+//
+// All Interlocked*Pointer intrinsics on x86/x64 are full barriers; the
+// pointer-width matches `intptr_t` on both Win32 and Win64 (CPython itself
+// requires a sane intptr_t == void* relationship).
+static inline intptr_t atomic_load_intptr(atomic_intptr_t *ptr) { return *ptr; }
+
+static inline void atomic_store_intptr(atomic_intptr_t *ptr, intptr_t value) {
+  *ptr = value;
+}
+
+static inline intptr_t atomic_exchange_intptr(atomic_intptr_t *ptr,
+                                              intptr_t value) {
+  return (intptr_t)InterlockedExchangePointer((PVOID volatile *)ptr,
+                                              (PVOID)value);
+}
+
+static inline bool atomic_compare_exchange_strong_intptr(atomic_intptr_t *ptr,
+                                                         intptr_t *expected,
+                                                         intptr_t desired) {
+  intptr_t prev = (intptr_t)InterlockedCompareExchangePointer(
+      (PVOID volatile *)ptr, (PVOID)desired, (PVOID)*expected);
+  if (prev == *expected) {
+    return true;
+  }
+  *expected = prev;
+  return false;
+}
+
+// All Interlocked* intrinsics on x86/x64 are full barriers, so the
+// memory_order argument is accepted but ignored.
+// Note: atomic_load_explicit is a plain volatile read. On x86/x64 this
+// provides acquire semantics due to TSO. Correctness of the parking
+// protocol relies on the mutex-protected re-check, not on seq_cst ordering.
+#define atomic_load_explicit(ptr, order) (*(ptr))
+#define atomic_fetch_add_explicit(ptr, val, order)                             \
+  InterlockedExchangeAdd64((ptr), (val))
+#define atomic_fetch_sub_explicit(ptr, val, order)                             \
+  InterlockedExchangeAdd64((ptr), -(val))
+#define memory_order_seq_cst 0
+
+// ---------------------------------------------------------------------------
+// Windows: BOCMutex / BOCCond on SRWLOCK + CONDITION_VARIABLE
+// ---------------------------------------------------------------------------
+
+typedef SRWLOCK BOCMutex;
+typedef CONDITION_VARIABLE BOCCond;
+
+static inline void boc_mtx_init(BOCMutex *m) { InitializeSRWLock(m); }
+
+static inline void mtx_destroy(BOCMutex *m) { (void)m; }
+
+static inline void mtx_lock(BOCMutex *m) { AcquireSRWLockExclusive(m); }
+
+static inline void mtx_unlock(BOCMutex *m) { ReleaseSRWLockExclusive(m); }
+
+static inline void cnd_init(BOCCond *c) { InitializeConditionVariable(c); }
+
+static inline void cnd_destroy(BOCCond *c) { (void)c; }
+
+static inline void cnd_signal(BOCCond *c) { WakeConditionVariable(c); }
+
+static inline void cnd_broadcast(BOCCond *c) { WakeAllConditionVariable(c); }
+
+static inline void cnd_wait(BOCCond *c, BOCMutex *m) {
+  SleepConditionVariableSRW(c, m, INFINITE, 0);
+}
+
+/// @brief Wait on a condition variable for at most @p seconds.
+/// @param c The condition variable
+/// @param m The mutex (must be held by caller)
+/// @return true if signalled (or spurious wake), false if the timeout expired
+static inline bool cnd_timedwait_s(BOCCond *c, BOCMutex *m, double seconds) {
+  // Negated form catches NaN (every comparison with NaN is false),
+  // which a bare `seconds < 0` test does not. Defence in depth
+  // for the public boundary helper `boc_validate_finite_timeout`.
+  if (!(seconds >= 0.0))
+    seconds = 0.0;
+  DWORD ms = (DWORD)(seconds * 1000.0);
+  BOOL ok = SleepConditionVariableSRW(c, m, ms, 0);
+  if (!ok && GetLastError() == ERROR_TIMEOUT) {
+    return false;
+  }
+  return true;
+}
+
+void thrd_sleep(const struct timespec *duration, struct timespec *remaining);
+
+// ---------------------------------------------------------------------------
+// Windows: typed `boc_atomic_*_explicit` storage typedefs
+// ---------------------------------------------------------------------------
+//
+// `volatile T` storage with distinct typedefs per width so the
+// dispatch picks the right Interlocked* family. Note these are ordinary
+// `volatile`, NOT C11 `_Atomic` — MSVC's `_Atomic` is gated behind
+// `<stdatomic.h>` (VS 2022 17.5+) which is above bocpy's VS 2019 floor.
+
+typedef volatile uint64_t boc_atomic_u64_t;
+typedef volatile uint32_t boc_atomic_u32_t;
+typedef volatile uint8_t boc_atomic_bool_t; // sizeof(bool) == 1
+typedef void *volatile boc_atomic_ptr_t;
+
+// ---------------------------------------------------------------------------
+// Windows: typed `boc_atomic_*_explicit` implementations
+// ---------------------------------------------------------------------------
+//
+// Switch on order, dispatch to Interlocked*. On x86/x64 every
+// Interlocked* intrinsic is a full barrier, so all orderings collapse
+// to the unsuffixed form (which is correct for any requested
+// ordering). On ARM64 we pick the matching `_acq`/`_rel`/`_nf`
+// variant. `BOC_MO_ACQ_REL` and `BOC_MO_SEQ_CST` use the unsuffixed
+// (full barrier) form on every target.
+
+#if defined(_M_ARM64)
+#define BOC_IL_LOAD64_ACQ(p)                                                   \
+  ((uint64_t)__ldar64((unsigned __int64 const volatile *)(p)))
+#define BOC_IL_LOAD32_ACQ(p)                                                   \
+  ((uint32_t)__ldar32((unsigned __int32 const volatile *)(p)))
+#define BOC_IL_LOAD8_ACQ(p)                                                    \
+  ((uint8_t)__ldar8((unsigned __int8 const volatile *)(p)))
+#define BOC_IL_STORE64_REL(p, v)                                               \
+  __stlr64((unsigned __int64 volatile *)(p), (unsigned __int64)(v))
+#define BOC_IL_STORE32_REL(p, v)                                               \
+  __stlr32((unsigned __int32 volatile *)(p), (unsigned __int32)(v))
+#define BOC_IL_STORE8_REL(p, v)                                                \
+  __stlr8((unsigned __int8 volatile *)(p), (unsigned __int8)(v))
+#endif
+
+// ---- u64 -------------------------------------------------------------------
+
+static inline uint64_t boc_atomic_load_u64_explicit(boc_atomic_u64_t *p,
+                                                    boc_memory_order_t order) {
+#if defined(_M_ARM64)
+  switch (order) {
+  case BOC_MO_RELAXED:
+    return *p;
+  case BOC_MO_ACQUIRE:
+    return BOC_IL_LOAD64_ACQ(p);
+  default:
+    return BOC_IL_LOAD64_ACQ(p);
+  }
+#else
+  (void)order;
+  return *p;
+#endif
+}
+
+static inline void boc_atomic_store_u64_explicit(boc_atomic_u64_t *p,
+                                                 uint64_t v,
+                                                 boc_memory_order_t order) {
+#if defined(_M_ARM64)
+  switch (order) {
+  case BOC_MO_RELAXED:
+    *p = v;
+    return;
+  case BOC_MO_RELEASE:
+    BOC_IL_STORE64_REL(p, v);
+    return;
+  default:
+    (void)_InterlockedExchange64((volatile __int64 *)p, (__int64)v);
+    return;
+  }
+#else
+  (void)order;
+  *p = v;
+#endif
+}
+
+static inline uint64_t
+boc_atomic_exchange_u64_explicit(boc_atomic_u64_t *p, uint64_t v,
+                                 boc_memory_order_t order) {
+#if defined(_M_ARM64)
+  switch (order) {
+  case BOC_MO_RELAXED:
+    return (uint64_t)_InterlockedExchange64_nf((volatile __int64 *)p,
+                                               (__int64)v);
+  case BOC_MO_ACQUIRE:
+    return (uint64_t)_InterlockedExchange64_acq((volatile __int64 *)p,
+                                                (__int64)v);
+  case BOC_MO_RELEASE:
+    return (uint64_t)_InterlockedExchange64_rel((volatile __int64 *)p,
+                                                (__int64)v);
+  default:
+    return (uint64_t)_InterlockedExchange64((volatile __int64 *)p, (__int64)v);
+  }
+#else
+  (void)order;
+  return (uint64_t)_InterlockedExchange64((volatile __int64 *)p, (__int64)v);
+#endif
+}
+
+static inline bool boc_atomic_compare_exchange_strong_u64_explicit(
+    boc_atomic_u64_t *p, uint64_t *expected, uint64_t desired,
+    boc_memory_order_t succ, boc_memory_order_t fail) {
+  (void)fail;
+  uint64_t exp = *expected;
+  uint64_t prev;
+#if defined(_M_ARM64)
+  switch (succ) {
+  case BOC_MO_RELAXED:
+    prev = (uint64_t)_InterlockedCompareExchange64_nf(
+        (volatile __int64 *)p, (__int64)desired, (__int64)exp);
+    break;
+  case BOC_MO_ACQUIRE:
+    prev = (uint64_t)_InterlockedCompareExchange64_acq(
+        (volatile __int64 *)p, (__int64)desired, (__int64)exp);
+    break;
+  case BOC_MO_RELEASE:
+    prev = (uint64_t)_InterlockedCompareExchange64_rel(
+        (volatile __int64 *)p, (__int64)desired, (__int64)exp);
+    break;
+  default:
+    prev = (uint64_t)_InterlockedCompareExchange64(
+        (volatile __int64 *)p, (__int64)desired, (__int64)exp);
+    break;
+  }
+#else
+  (void)succ;
+  prev = (uint64_t)_InterlockedCompareExchange64(
+      (volatile __int64 *)p, (__int64)desired, (__int64)exp);
+#endif
+  if (prev == exp)
+    return true;
+  *expected = prev;
+  return false;
+}
+
+static inline uint64_t
+boc_atomic_fetch_add_u64_explicit(boc_atomic_u64_t *p, uint64_t v,
+                                  boc_memory_order_t order) {
+#if defined(_M_ARM64)
+  switch (order) {
+  case BOC_MO_RELAXED:
+    return (uint64_t)_InterlockedExchangeAdd64_nf((volatile __int64 *)p,
+                                                  (__int64)v);
+  case BOC_MO_ACQUIRE:
+    return (uint64_t)_InterlockedExchangeAdd64_acq((volatile __int64 *)p,
+                                                   (__int64)v);
+  case BOC_MO_RELEASE:
+    return (uint64_t)_InterlockedExchangeAdd64_rel((volatile __int64 *)p,
+                                                   (__int64)v);
+  default:
+    return (uint64_t)_InterlockedExchangeAdd64((volatile __int64 *)p,
+                                               (__int64)v);
+  }
+#else
+  (void)order;
+  return (uint64_t)_InterlockedExchangeAdd64((volatile __int64 *)p, (__int64)v);
+#endif
+}
+
+static inline uint64_t
+boc_atomic_fetch_sub_u64_explicit(boc_atomic_u64_t *p, uint64_t v,
+                                  boc_memory_order_t order) {
+  return boc_atomic_fetch_add_u64_explicit(p, (uint64_t)(-(int64_t)v), order);
+}
+
+// ---- u32 -------------------------------------------------------------------
+
+static inline uint32_t boc_atomic_load_u32_explicit(boc_atomic_u32_t *p,
+                                                    boc_memory_order_t order) {
+#if defined(_M_ARM64)
+  switch (order) {
+  case BOC_MO_RELAXED:
+    return *p;
+  case BOC_MO_ACQUIRE:
+    return BOC_IL_LOAD32_ACQ(p);
+  default:
+    return BOC_IL_LOAD32_ACQ(p);
+  }
+#else
+  (void)order;
+  return *p;
+#endif
+}
+
+static inline void boc_atomic_store_u32_explicit(boc_atomic_u32_t *p,
+                                                 uint32_t v,
+                                                 boc_memory_order_t order) {
+#if defined(_M_ARM64)
+  switch (order) {
+  case BOC_MO_RELAXED:
+    *p = v;
+    return;
+  case BOC_MO_RELEASE:
+    BOC_IL_STORE32_REL(p, v);
+    return;
+  default:
+    (void)_InterlockedExchange((volatile long *)p, (long)v);
+    return;
+  }
+#else
+  (void)order;
+  *p = v;
+#endif
+}
+
+static inline uint32_t
+boc_atomic_exchange_u32_explicit(boc_atomic_u32_t *p, uint32_t v,
+                                 boc_memory_order_t order) {
+#if defined(_M_ARM64)
+  switch (order) {
+  case BOC_MO_RELAXED:
+    return (uint32_t)_InterlockedExchange_nf((volatile long *)p, (long)v);
+  case BOC_MO_ACQUIRE:
+    return (uint32_t)_InterlockedExchange_acq((volatile long *)p, (long)v);
+  case BOC_MO_RELEASE:
+    return (uint32_t)_InterlockedExchange_rel((volatile long *)p, (long)v);
+  default:
+    return (uint32_t)_InterlockedExchange((volatile long *)p, (long)v);
+  }
+#else
+  (void)order;
+  return (uint32_t)_InterlockedExchange((volatile long *)p, (long)v);
+#endif
+}
+
+static inline bool boc_atomic_compare_exchange_strong_u32_explicit(
+    boc_atomic_u32_t *p, uint32_t *expected, uint32_t desired,
+    boc_memory_order_t succ, boc_memory_order_t fail) {
+  (void)fail;
+  uint32_t exp = *expected;
+  uint32_t prev;
+#if defined(_M_ARM64)
+  switch (succ) {
+  case BOC_MO_RELAXED:
+    prev = (uint32_t)_InterlockedCompareExchange_nf((volatile long *)p,
+                                                    (long)desired, (long)exp);
+    break;
+  case BOC_MO_ACQUIRE:
+    prev = (uint32_t)_InterlockedCompareExchange_acq((volatile long *)p,
+                                                     (long)desired, (long)exp);
+    break;
+  case BOC_MO_RELEASE:
+    prev = (uint32_t)_InterlockedCompareExchange_rel((volatile long *)p,
+                                                     (long)desired, (long)exp);
+    break;
+  default:
+    prev = (uint32_t)_InterlockedCompareExchange((volatile long *)p,
+                                                 (long)desired, (long)exp);
+    break;
+  }
+#else
+  (void)succ;
+  prev = (uint32_t)_InterlockedCompareExchange((volatile long *)p,
+                                               (long)desired, (long)exp);
+#endif
+  if (prev == exp)
+    return true;
+  *expected = prev;
+  return false;
+}
+
+static inline uint32_t
+boc_atomic_fetch_add_u32_explicit(boc_atomic_u32_t *p, uint32_t v,
+                                  boc_memory_order_t order) {
+#if defined(_M_ARM64)
+  switch (order) {
+  case BOC_MO_RELAXED:
+    return (uint32_t)_InterlockedExchangeAdd_nf((volatile long *)p, (long)v);
+  case BOC_MO_ACQUIRE:
+    return (uint32_t)_InterlockedExchangeAdd_acq((volatile long *)p, (long)v);
+  case BOC_MO_RELEASE:
+    return (uint32_t)_InterlockedExchangeAdd_rel((volatile long *)p, (long)v);
+  default:
+    return (uint32_t)_InterlockedExchangeAdd((volatile long *)p, (long)v);
+  }
+#else
+  (void)order;
+  return (uint32_t)_InterlockedExchangeAdd((volatile long *)p, (long)v);
+#endif
+}
+
+static inline uint32_t
+boc_atomic_fetch_sub_u32_explicit(boc_atomic_u32_t *p, uint32_t v,
+                                  boc_memory_order_t order) {
+  return boc_atomic_fetch_add_u32_explicit(p, (uint32_t)(-(int32_t)v), order);
+}
+
+// ---- bool (uint8_t storage) ------------------------------------------------
+// MSVC has no Interlocked*8 with order suffixes pre-VS-2022; we use the
+// unsuffixed Interlocked*8 (full barrier) for exchange/cas, which satisfies
+// any requested ordering. Plain volatile load/store on a 1-byte slot is
+// atomic on every supported MSVC target (ARM64 included; the architecture
+// guarantees aligned single-byte access atomicity).
+
+static inline bool boc_atomic_load_bool_explicit(boc_atomic_bool_t *p,
+                                                 boc_memory_order_t order) {
+#if defined(_M_ARM64)
+  switch (order) {
+  case BOC_MO_RELAXED:
+    return (bool)*p;
+  case BOC_MO_ACQUIRE:
+    return (bool)BOC_IL_LOAD8_ACQ(p);
+  default:
+    return (bool)BOC_IL_LOAD8_ACQ(p);
+  }
+#else
+  (void)order;
+  return (bool)*p;
+#endif
+}
+
+static inline void boc_atomic_store_bool_explicit(boc_atomic_bool_t *p, bool v,
+                                                  boc_memory_order_t order) {
+#if defined(_M_ARM64)
+  switch (order) {
+  case BOC_MO_RELAXED:
+    *p = (uint8_t)v;
+    return;
+  case BOC_MO_RELEASE:
+    BOC_IL_STORE8_REL(p, (uint8_t)v);
+    return;
+  default:
+    (void)_InterlockedExchange8((volatile char *)p, (char)v);
+    return;
+  }
+#else
+  (void)order;
+  *p = (uint8_t)v;
+#endif
+}
+
+static inline bool boc_atomic_exchange_bool_explicit(boc_atomic_bool_t *p,
+                                                     bool v,
+                                                     boc_memory_order_t order) {
+  (void)order;
+  return (bool)_InterlockedExchange8((volatile char *)p, (char)v);
+}
+
+static inline bool boc_atomic_compare_exchange_strong_bool_explicit(
+    boc_atomic_bool_t *p, bool *expected, bool desired, boc_memory_order_t succ,
+    boc_memory_order_t fail) {
+  (void)succ;
+  (void)fail;
+  char exp = (char)*expected;
+  char prev =
+      _InterlockedCompareExchange8((volatile char *)p, (char)desired, exp);
+  if (prev == exp)
+    return true;
+  *expected = (bool)prev;
+  return false;
+}
+
+// ---- ptr -------------------------------------------------------------------
+
+static inline void *boc_atomic_load_ptr_explicit(boc_atomic_ptr_t *p,
+                                                 boc_memory_order_t order) {
+  // InterlockedCompareExchangePointerNoFence is the cleanest way to express
+  // a relaxed atomic pointer load, but a plain volatile read suffices on
+  // every supported target (pointer width matches the natural word size).
+  (void)order;
+  return (void *)*p;
+}
+
+static inline void boc_atomic_store_ptr_explicit(boc_atomic_ptr_t *p, void *v,
+                                                 boc_memory_order_t order) {
+#if defined(_M_ARM64)
+  if (order == BOC_MO_RELAXED) {
+    *p = v;
+    return;
+  }
+  (void)InterlockedExchangePointer((PVOID volatile *)p, (PVOID)v);
+#else
+  (void)order;
+  *p = v;
+#endif
+}
+
+static inline void *boc_atomic_exchange_ptr_explicit(boc_atomic_ptr_t *p,
+                                                     void *v,
+                                                     boc_memory_order_t order) {
+  (void)order;
+  return (void *)InterlockedExchangePointer((PVOID volatile *)p, (PVOID)v);
+}
+
+static inline bool boc_atomic_compare_exchange_strong_ptr_explicit(
+    boc_atomic_ptr_t *p, void **expected, void *desired,
+    boc_memory_order_t succ, boc_memory_order_t fail) {
+  (void)succ;
+  (void)fail;
+  void *exp = *expected;
+  void *prev = InterlockedCompareExchangePointer((PVOID volatile *)p,
+                                                 (PVOID)desired, (PVOID)exp);
+  if (prev == exp)
+    return true;
+  *expected = prev;
+  return false;
+}
+
+// Standalone memory fence. `MemoryBarrier()` is a full hardware
+// barrier on every supported MSVC target (x86, x64, ARM64) and
+// matches the strongest standalone fence we ever need from this
+// helper. Mapping every `BOC_MO_*` to a full barrier is correct
+// (over-strong is safe; under-strong is not) and keeps the
+// implementation a one-liner.
+static inline void boc_atomic_thread_fence_explicit(boc_memory_order_t o) {
+  (void)o;
+  MemoryBarrier();
+}
+
+#else // _WIN32
+
+// ---------------------------------------------------------------------------
+// POSIX (Apple + Linux): shared headers, thread_local, yield, intptr aliases
+// ---------------------------------------------------------------------------
+
+#include <errno.h>
+#include <sched.h>
+#include <stdatomic.h>
+#include <unistd.h>
+
+#define thread_local _Thread_local
+#define boc_yield() sched_yield()
+
+// On POSIX the C11 atomic_* macros dispatch on type via _Generic, so the
+// `atomic_load(&intptr_var)` form Just Works. The `_intptr` siblings are
+// aliased to the generic forms purely so the source reads the same on
+// every platform; on Windows they expand to dedicated InterlockedXxxPointer
+// shims (see polyfill block above).
+#define atomic_load_intptr(ptr) atomic_load(ptr)
+#define atomic_store_intptr(ptr, val) atomic_store((ptr), (val))
+#define atomic_exchange_intptr(ptr, val) atomic_exchange((ptr), (val))
+#define atomic_compare_exchange_strong_intptr(ptr, expected, desired)          \
+  atomic_compare_exchange_strong((ptr), (expected), (desired))
+
+#ifdef __APPLE__
+
+// ---------------------------------------------------------------------------
+// Apple: pthread-based BOCMutex / BOCCond
+// ---------------------------------------------------------------------------
+
+#include <pthread.h>
+#define thrd_sleep nanosleep
+
+typedef pthread_mutex_t BOCMutex;
+typedef pthread_cond_t BOCCond;
+
+static inline void boc_mtx_init(BOCMutex *m) { pthread_mutex_init(m, NULL); }
+
+static inline void mtx_destroy(BOCMutex *m) { pthread_mutex_destroy(m); }
+
+static inline void mtx_lock(BOCMutex *m) { pthread_mutex_lock(m); }
+
+static inline void mtx_unlock(BOCMutex *m) { pthread_mutex_unlock(m); }
+
+static inline void cnd_init(BOCCond *c) { pthread_cond_init(c, NULL); }
+
+static inline void cnd_destroy(BOCCond *c) { pthread_cond_destroy(c); }
+
+static inline void cnd_signal(BOCCond *c) { pthread_cond_signal(c); }
+
+static inline void cnd_broadcast(BOCCond *c) { pthread_cond_broadcast(c); }
+
+static inline void cnd_wait(BOCCond *c, BOCMutex *m) {
+  pthread_cond_wait(c, m);
+}
+
+/// @brief Wait on a condition variable for at most @p seconds.
+/// @param c The condition variable
+/// @param m The mutex (must be held by caller)
+/// @return true if signalled (or spurious wake), false if the timeout expired
+static inline bool cnd_timedwait_s(BOCCond *c, BOCMutex *m, double seconds) {
+  // Negated form catches NaN (every comparison with NaN is false),
+  // which a bare `seconds < 0` test does not. Defence in depth
+  // for the public boundary helper `boc_validate_finite_timeout`.
+  if (!(seconds >= 0.0))
+    seconds = 0.0;
+  struct timespec ts;
+  clock_gettime(CLOCK_REALTIME, &ts);
+  double total = (double)ts.tv_sec + (double)ts.tv_nsec * 1e-9 + seconds;
+  ts.tv_sec = (time_t)total;
+  ts.tv_nsec = (long)((total - (double)ts.tv_sec) * 1e9);
+  if (ts.tv_nsec >= 1000000000L) {
+    ts.tv_sec += 1;
+    ts.tv_nsec -= 1000000000L;
+  }
+  int rc = pthread_cond_timedwait(c, m, &ts);
+  return rc != ETIMEDOUT;
+}
+
+#else // __APPLE__
+
+// ---------------------------------------------------------------------------
+// Linux (and other non-Apple POSIX): C11 <threads.h>-based BOCMutex / BOCCond
+// ---------------------------------------------------------------------------
+
+#include <threads.h>
+
+typedef mtx_t BOCMutex;
+typedef cnd_t BOCCond;
+
+static inline void boc_mtx_init(BOCMutex *m) { mtx_init(m, mtx_plain); }
+
+/// @brief Wait on a condition variable for at most @p seconds.
+/// @param c The condition variable
+/// @param m The mutex (must be held by caller)
+/// @return true if signalled (or spurious wake), false if the timeout expired
+static inline bool cnd_timedwait_s(BOCCond *c, BOCMutex *m, double seconds) {
+  // Negated form catches NaN (every comparison with NaN is false),
+  // which a bare `seconds < 0` test does not. Defence in depth
+  // for the public boundary helper `boc_validate_finite_timeout`.
+  if (!(seconds >= 0.0))
+    seconds = 0.0;
+  struct timespec ts;
+  clock_gettime(CLOCK_REALTIME, &ts);
+  double total = (double)ts.tv_sec + (double)ts.tv_nsec * 1e-9 + seconds;
+  ts.tv_sec = (time_t)total;
+  ts.tv_nsec = (long)((total - (double)ts.tv_sec) * 1e9);
+  if (ts.tv_nsec >= 1000000000L) {
+    ts.tv_sec += 1;
+    ts.tv_nsec -= 1000000000L;
+  }
+  int rc = cnd_timedwait(c, m, &ts);
+  return rc != thrd_timedout;
+}
+
+#endif // __APPLE__
+
+// ---------------------------------------------------------------------------
+// POSIX: typed `boc_atomic_*_explicit` API on top of <stdatomic.h>
+// ---------------------------------------------------------------------------
+//
+// The compiler folds these wrappers away. Legacy `atomic_*` callers
+// are unaffected; the new API is purely additive.
+
+typedef _Atomic uint64_t boc_atomic_u64_t;
+typedef _Atomic uint32_t boc_atomic_u32_t;
+typedef _Atomic bool boc_atomic_bool_t;
+typedef _Atomic(void *) boc_atomic_ptr_t;
+
+static inline memory_order boc_mo_to_std(boc_memory_order_t order) {
+  switch (order) {
+  case BOC_MO_RELAXED:
+    return memory_order_relaxed;
+  case BOC_MO_ACQUIRE:
+    return memory_order_acquire;
+  case BOC_MO_RELEASE:
+    return memory_order_release;
+  case BOC_MO_ACQ_REL:
+    return memory_order_acq_rel;
+  case BOC_MO_SEQ_CST:
+  default:
+    return memory_order_seq_cst;
+  }
+}
+
+#define BOC_ATOMIC_OPS_(SUF, T, AT)                                            \
+  static inline T boc_atomic_load_##SUF##_explicit(AT *p,                      \
+                                                   boc_memory_order_t o) {     \
+    return atomic_load_explicit(p, boc_mo_to_std(o));                          \
+  }                                                                            \
+  static inline void boc_atomic_store_##SUF##_explicit(AT *p, T v,             \
+                                                       boc_memory_order_t o) { \
+    atomic_store_explicit(p, v, boc_mo_to_std(o));                             \
+  }                                                                            \
+  static inline T boc_atomic_exchange_##SUF##_explicit(AT *p, T v,             \
+                                                       boc_memory_order_t o) { \
+    return atomic_exchange_explicit(p, v, boc_mo_to_std(o));                   \
+  }                                                                            \
+  static inline bool boc_atomic_compare_exchange_strong_##SUF##_explicit(      \
+      AT *p, T *expected, T desired, boc_memory_order_t succ,                  \
+      boc_memory_order_t fail) {                                               \
+    return atomic_compare_exchange_strong_explicit(                            \
+        p, expected, desired, boc_mo_to_std(succ), boc_mo_to_std(fail));       \
+  }
+
+BOC_ATOMIC_OPS_(u64, uint64_t, boc_atomic_u64_t)
+BOC_ATOMIC_OPS_(u32, uint32_t, boc_atomic_u32_t)
+BOC_ATOMIC_OPS_(bool, bool, boc_atomic_bool_t)
+
+// `ptr` carries `void *` payload but the underlying storage is
+// `_Atomic(void *)`; cast at the API edge to keep call sites clean.
+static inline void *boc_atomic_load_ptr_explicit(boc_atomic_ptr_t *p,
+                                                 boc_memory_order_t o) {
+  return atomic_load_explicit(p, boc_mo_to_std(o));
+}
+static inline void boc_atomic_store_ptr_explicit(boc_atomic_ptr_t *p, void *v,
+                                                 boc_memory_order_t o) {
+  atomic_store_explicit(p, v, boc_mo_to_std(o));
+}
+static inline void *boc_atomic_exchange_ptr_explicit(boc_atomic_ptr_t *p,
+                                                     void *v,
+                                                     boc_memory_order_t o) {
+  return atomic_exchange_explicit(p, v, boc_mo_to_std(o));
+}
+static inline bool boc_atomic_compare_exchange_strong_ptr_explicit(
+    boc_atomic_ptr_t *p, void **expected, void *desired,
+    boc_memory_order_t succ, boc_memory_order_t fail) {
+  return atomic_compare_exchange_strong_explicit(
+      p, expected, desired, boc_mo_to_std(succ), boc_mo_to_std(fail));
+}
+
+#define BOC_ATOMIC_FETCH_OPS_(SUF, T, AT)                                      \
+  static inline T boc_atomic_fetch_add_##SUF##_explicit(                       \
+      AT *p, T v, boc_memory_order_t o) {                                      \
+    return atomic_fetch_add_explicit(p, v, boc_mo_to_std(o));                  \
+  }                                                                            \
+  static inline T boc_atomic_fetch_sub_##SUF##_explicit(                       \
+      AT *p, T v, boc_memory_order_t o) {                                      \
+    return atomic_fetch_sub_explicit(p, v, boc_mo_to_std(o));                  \
+  }
+
+BOC_ATOMIC_FETCH_OPS_(u64, uint64_t, boc_atomic_u64_t)
+BOC_ATOMIC_FETCH_OPS_(u32, uint32_t, boc_atomic_u32_t)
+
+#undef BOC_ATOMIC_OPS_
+#undef BOC_ATOMIC_FETCH_OPS_
+
+// Standalone memory fence. POSIX delegates to `atomic_thread_fence`
+// from `<stdatomic.h>`; the helper exists so MSVC can express the
+// same operation via `MemoryBarrier()` without C11 atomics.
+static inline void boc_atomic_thread_fence_explicit(boc_memory_order_t o) {
+  atomic_thread_fence(boc_mo_to_std(o));
+}
+
+#endif // _WIN32
+
+// ===========================================================================
+// Cross-platform monotonic time / sleep helpers
+// ===========================================================================
+
+/// @brief Returns the current time as double-precision seconds.
+/// @return the current time
+double boc_now_s(void);
+
+/// @brief Returns a monotonic timestamp in nanoseconds.
+/// @details Uses @c CLOCK_MONOTONIC on POSIX and
+/// @c QueryPerformanceCounter on Windows. Unlike @ref boc_now_s the
+/// returned value is guaranteed monotonic non-decreasing within a
+/// single process: it is suitable for measuring elapsed durations
+/// (e.g. the work-stealing quiescence timeout) but not for wall-clock
+/// reporting. Wraps after ~584 years on a 64-bit unsigned counter; we
+/// only ever subtract two readings taken seconds apart, so wraparound
+/// is a non-issue.
+/// @return Monotonic time in nanoseconds since an unspecified epoch.
+uint64_t boc_now_ns(void);
+
+/// @brief Sleep the calling thread for at least @p ns nanoseconds.
+/// @details Thin wrapper around @ref thrd_sleep that hides the
+/// @c struct timespec construction so callers never need to include
+/// @c <time.h> just to back off. Splits @p ns into seconds plus a
+/// sub-second remainder so values larger than one second are
+/// representable.
+/// @param ns Nanoseconds to sleep. Zero is a no-op return.
+void boc_sleep_ns(uint64_t ns);
+
+// ===========================================================================
+// Cross-platform timeout-validation helper
+// ===========================================================================
+//
+// Public boundary helper for the @c terminator_wait / @c notice_sync_wait
+// entry points. Centralising the NaN/Inf/negative classification here
+// keeps the policy in one place: NaN is a programmer error and surfaces
+// as @c ValueError; +Inf is "wait forever"; negative is clamped to 0
+// (no-wait, returns immediately). Without this, NaN passed straight
+// to @c cnd_timedwait_s would compute @c DWORD ms via @c (DWORD)(NaN *
+// 1000.0) — undefined behaviour on Windows and a wedged-forever wait
+// on POSIX.
+//
+// Returns 0 on success (with @p *wait_forever set); -1 on failure with
+// a Python exception set.
+
+static inline int boc_validate_finite_timeout(double seconds,
+                                              double *out_seconds,
+                                              bool *out_wait_forever) {
+  // NaN: a comparison with NaN is always false, so `seconds == seconds`
+  // is the canonical portable NaN check (no math.h dependency).
+  if (seconds != seconds) {
+    PyErr_SetString(PyExc_ValueError, "timeout must not be NaN");
+    return -1;
+  }
+  // +Inf or any value that the cnd_timedwait clamp would treat as
+  // "wait forever" maps to wait_forever=true. Use a finite sentinel
+  // (DBL_MAX) rather than HUGE_VAL to keep the helper free of math.h
+  // — the operational meaning is identical.
+  //
+  // We clamp at 1e9 seconds (~31.7 years) rather than DBL_MAX so
+  // any caller-supplied value that would overflow `time_t` (signed
+  // 32-bit on some platforms: ~68 years) or the `DWORD` millisecond
+  // arg to Win32 `SleepConditionVariableSRW` (max ~49 days) also
+  // routes through the wait-forever path. Operationally a 31-year
+  // wait is indistinguishable from "wait forever" for any realistic
+  // bocpy caller, and the clamp is the only safe way to avoid
+  // platform-dependent overflow into a sub-second wait or UB.
+  if (seconds > 1e9) {
+    *out_seconds = 0.0;
+    *out_wait_forever = true;
+    return 0;
+  }
+  // Negative: caller asked for "no wait". Clamp to 0 and return; the
+  // wait helpers will short-circuit with a timeout immediately.
+  if (seconds < 0.0) {
+    *out_seconds = 0.0;
+    *out_wait_forever = false;
+    return 0;
+  }
+  *out_seconds = seconds;
+  *out_wait_forever = false;
+  return 0;
+}
+
+#endif // BOCPY_COMPAT_H
diff --git a/src/bocpy/cown.h b/src/bocpy/cown.h
new file mode 100644
index 0000000..5ff10ec
--- /dev/null
+++ b/src/bocpy/cown.h
@@ -0,0 +1,40 @@
+/// @file cown.h
+/// @brief Minimal cross-TU surface for the cown refcount API.
+///
+/// This header exists so that translation units other than `_core.c`
+/// (for now: `noticeboard.c`) can hold strong references to a
+/// `BOCCown` without needing to know its layout. The full struct
+/// definition and the implementation of @ref cown_incref / @ref
+/// cown_decref live in `_core.c`. The per-call cost of the indirect
+/// call at noticeboard call sites is negligible: every noticeboard
+/// mutation already takes a mutex and performs XIData serialization,
+/// both orders of magnitude more expensive than the indirect call.
+
+#ifndef BOCPY_COWN_H
+#define BOCPY_COWN_H
+
+#define PY_SSIZE_T_CLEAN
+
+#include <Python.h>
+#include <stdint.h>
+
+/// @brief Opaque forward declaration. The struct body lives in `_core.c`.
+typedef struct boc_cown BOCCown;
+
+/// @brief Python wrapper exposing a single @ref BOCCown to user code.
+typedef struct cown_capsule_object {
+  PyObject_HEAD BOCCown *cown;
+} CownCapsuleObject;
+
+/// @brief Acquire one strong reference on @p cown.
+/// @return The post-increment refcount.
+int_least64_t cown_incref(BOCCown *cown);
+
+/// @brief Release one strong reference on @p cown.
+/// @return The post-decrement refcount.
+int_least64_t cown_decref(BOCCown *cown);
+
+#define COWN_INCREF(c) cown_incref((c))
+#define COWN_DECREF(c) cown_decref(c)
+
+#endif // BOCPY_COWN_H
diff --git a/src/bocpy/noticeboard.c b/src/bocpy/noticeboard.c
new file mode 100644
index 0000000..77fe629
--- /dev/null
+++ b/src/bocpy/noticeboard.c
@@ -0,0 +1,704 @@
+/// @file noticeboard.c
+/// @brief Implementation of the global noticeboard subsystem.
+///
+/// See @ref noticeboard.h for the public API and the thread/PyErr
+/// discipline. This TU owns:
+///
+///   - The fixed-capacity entry table @c NB plus its mutex.
+///   - The monotonic version counter @c NB_VERSION.
+///   - The per-thread snapshot cache (dict, proxy, version, checked
+///     flag).
+///   - The single-writer thread-identity check (@c NB_NOTICEBOARD_TID).
+///   - The notice_sync barrier primitives (@c NB_SYNC_REQUESTED,
+///     @c NB_SYNC_PROCESSED, @c NB_SYNC_MUTEX, @c NB_SYNC_COND).
+
+#include "noticeboard.h"
+
+#include <string.h>
+
+// ---------------------------------------------------------------------------
+// File-scope state.
+// ---------------------------------------------------------------------------
+
+/// @brief A single noticeboard entry.
+typedef struct nb_entry {
+  /// @brief The key for this entry (null-terminated UTF-8).
+  char key[NB_KEY_SIZE];
+  /// @brief The serialized cross-interpreter data.
+  XIDATA_T *value;
+  /// @brief Whether the value was pickled during serialization.
+  bool pickled;
+  /// @brief BOCCowns referenced by @ref value, pinned by this entry.
+  BOCCown **pinned_cowns;
+  /// @brief Number of entries in @ref pinned_cowns.
+  int pinned_count;
+} NoticeboardEntry;
+
+/// @brief Global noticeboard for cross-behavior key-value storage.
+typedef struct noticeboard {
+  NoticeboardEntry entries[NB_MAX_ENTRIES];
+  int count;
+  BOCMutex mutex;
+} Noticeboard;
+
+static Noticeboard NB;
+
+/// @brief Monotonic version counter for the noticeboard.
+static atomic_int_least64_t NB_VERSION = 0;
+
+/// @brief Thread-local snapshot cache for the current behavior.
+static thread_local PyObject *NB_SNAPSHOT_CACHE = NULL;
+
+/// @brief Version of the noticeboard at the time the cached snapshot
+///        was built.
+static thread_local int_least64_t NB_SNAPSHOT_VERSION = -1;
+
+/// @brief Whether the cached snapshot has been version-checked this
+///        behavior.
+static thread_local bool NB_VERSION_CHECKED = false;
+
+/// @brief Read-only proxy wrapping the cached snapshot dict.
+static thread_local PyObject *NB_SNAPSHOT_PROXY = NULL;
+
+/// @brief Thread identity of the noticeboard mutator thread, or 0 if
+///        unset.
+static atomic_intptr_t NB_NOTICEBOARD_TID = 0;
+
+/// @brief Monotonic counter incremented by every notice_sync caller.
+static atomic_int_least64_t NB_SYNC_REQUESTED = 0;
+
+/// @brief Highest sequence number processed by the noticeboard thread.
+static atomic_int_least64_t NB_SYNC_PROCESSED = 0;
+
+/// @brief Mutex protecting NB_SYNC_COND.
+static BOCMutex NB_SYNC_MUTEX;
+
+/// @brief Condition variable signalled when NB_SYNC_PROCESSED advances.
+static BOCCond NB_SYNC_COND;
+
+// ---------------------------------------------------------------------------
+// Module init / teardown.
+// ---------------------------------------------------------------------------
+
+void noticeboard_init(void) {
+  memset(&NB, 0, sizeof(NB));
+  boc_mtx_init(&NB.mutex);
+  boc_mtx_init(&NB_SYNC_MUTEX);
+  cnd_init(&NB_SYNC_COND);
+}
+
+void noticeboard_destroy(void) {
+  // Drop the calling thread's snapshot cache before freeing entries.
+  Py_CLEAR(NB_SNAPSHOT_PROXY);
+  Py_CLEAR(NB_SNAPSHOT_CACHE);
+  NB_SNAPSHOT_VERSION = -1;
+  NB_VERSION_CHECKED = false;
+
+  // Collect entries to free after releasing the mutex — XIDATA_FREE
+  // and COWN_DECREF can run Python __del__ which may re-enter.
+  XIDATA_T *to_free[NB_MAX_ENTRIES];
+  int to_free_count = 0;
+  BOCCown **to_unpin[NB_MAX_ENTRIES];
+  int to_unpin_count[NB_MAX_ENTRIES];
+  int to_unpin_entries = 0;
+
+  mtx_lock(&NB.mutex);
+  for (int i = 0; i < NB.count; i++) {
+    if (NB.entries[i].value != NULL) {
+      to_free[to_free_count++] = NB.entries[i].value;
+      NB.entries[i].value = NULL;
+    }
+    if (NB.entries[i].pinned_cowns != NULL) {
+      to_unpin[to_unpin_entries] = NB.entries[i].pinned_cowns;
+      to_unpin_count[to_unpin_entries] = NB.entries[i].pinned_count;
+      to_unpin_entries++;
+      NB.entries[i].pinned_cowns = NULL;
+      NB.entries[i].pinned_count = 0;
+    }
+  }
+  NB.count = 0;
+  memset(NB.entries, 0, sizeof(NB.entries));
+  mtx_unlock(&NB.mutex);
+
+  for (int i = 0; i < to_free_count; i++) {
+    XIDATA_FREE(to_free[i]);
+  }
+  for (int i = 0; i < to_unpin_entries; i++) {
+    for (int j = 0; j < to_unpin_count[i]; j++) {
+      COWN_DECREF(to_unpin[i][j]);
+    }
+    PyMem_RawFree(to_unpin[i]);
+  }
+
+  mtx_destroy(&NB.mutex);
+  // NB_SYNC_MUTEX / NB_SYNC_COND are SRWLOCK / CONDITION_VARIABLE on
+  // Windows (no destroy needed) and pthread / mtx_t on POSIX (handled
+  // by mtx_destroy / cnd_destroy in compat.h shims). The original
+  // _core.c module-free path never destroyed these; preserve that
+  // behaviour to keep the symbol-additions-only invariant.
+}
+
+// ---------------------------------------------------------------------------
+// Single-writer thread-identity check.
+// ---------------------------------------------------------------------------
+
+int noticeboard_check_thread(const char *op_name) {
+  uintptr_t owner = (uintptr_t)atomic_load_intptr(&NB_NOTICEBOARD_TID);
+  if (owner == 0) {
+    return 0;
+  }
+  uintptr_t self_id = (uintptr_t)PyThread_get_thread_ident();
+  if (owner != self_id) {
+    PyErr_Format(PyExc_RuntimeError,
+                 "%s must be called from the noticeboard thread", op_name);
+    return -1;
+  }
+  return 0;
+}
+
+int noticeboard_set_thread(void) {
+  intptr_t expected = 0;
+  intptr_t self_id = (intptr_t)(uintptr_t)PyThread_get_thread_ident();
+  // One-shot per runtime: refuse if the slot is already owned.
+  // noticeboard_clear_thread() resets NB_NOTICEBOARD_TID to 0 at
+  // stop(), so a fresh start() cycle is fine. This closes the
+  // hijack-the-mutator-slot hole identified by the security lens.
+  if (!atomic_compare_exchange_strong_intptr(&NB_NOTICEBOARD_TID, &expected,
+                                             self_id)) {
+    PyErr_SetString(PyExc_RuntimeError,
+                    "set_noticeboard_thread: noticeboard mutator thread "
+                    "is already registered");
+    return -1;
+  }
+  return 0;
+}
+
+void noticeboard_clear_thread(void) {
+  (void)atomic_exchange_intptr(&NB_NOTICEBOARD_TID, (intptr_t)0);
+}
+
+// ---------------------------------------------------------------------------
+// Snapshot cache primitives.
+// ---------------------------------------------------------------------------
+
+void noticeboard_drop_local_cache(void) {
+  Py_CLEAR(NB_SNAPSHOT_PROXY);
+  Py_CLEAR(NB_SNAPSHOT_CACHE);
+  NB_SNAPSHOT_VERSION = -1;
+  NB_VERSION_CHECKED = false;
+}
+
+void noticeboard_cache_clear_for_behavior(void) { NB_VERSION_CHECKED = false; }
+
+int_least64_t noticeboard_version(void) { return atomic_load(&NB_VERSION); }
+
+// ---------------------------------------------------------------------------
+// Pin helper.
+// ---------------------------------------------------------------------------
+
+int nb_pin_cowns(PyObject *cowns, BOCCown ***out_array, int *out_count) {
+  *out_array = NULL;
+  *out_count = 0;
+
+  if (cowns == NULL || cowns == Py_None) {
+    return 0;
+  }
+
+  PyObject *seq =
+      PySequence_Fast(cowns, "noticeboard pin list must be a sequence");
+  if (seq == NULL) {
+    return -1;
+  }
+
+  Py_ssize_t n = PySequence_Fast_GET_SIZE(seq);
+  if (n == 0) {
+    Py_DECREF(seq);
+    return 0;
+  }
+
+  BOCCown **pins = (BOCCown **)PyMem_RawMalloc(sizeof(BOCCown *) * n);
+  if (pins == NULL) {
+    Py_DECREF(seq);
+    PyErr_NoMemory();
+    return -1;
+  }
+
+  int taken = 0;
+  for (Py_ssize_t i = 0; i < n; i++) {
+    PyObject *item = PySequence_Fast_GET_ITEM(seq, i);
+    BOCCown *cown = (BOCCown *)PyLong_AsVoidPtr(item);
+    if (cown == NULL) {
+      // PyLong_AsVoidPtr returns NULL both on error and for integer 0.
+      // Reject both paths explicitly: a NULL pin would be dereferenced
+      // downstream (COWN_DECREF on NULL is UB), and an integer 0 is
+      // indistinguishable from a crafted attacker pin pointing at the
+      // zero page.
+      if (!PyErr_Occurred()) {
+        PyErr_SetString(PyExc_ValueError,
+                        "noticeboard pin list must not contain NULL / "
+                        "integer 0 entries");
+      } else {
+        PyErr_SetString(PyExc_TypeError,
+                        "noticeboard pin list must contain only integer "
+                        "BOCCown pointers (use _core.cown_pin_pointers())");
+      }
+      goto fail;
+    }
+    pins[taken++] = cown;
+  }
+
+  Py_DECREF(seq);
+  *out_array = pins;
+  *out_count = taken;
+  return 0;
+
+fail:
+  // Release every transferred ref the writer pre-INCREFed for us.
+  for (int i = 0; i < taken; i++) {
+    COWN_DECREF(pins[i]);
+  }
+  for (Py_ssize_t i = (Py_ssize_t)taken + 1; i < n; i++) {
+    PyObject *item = PySequence_Fast_GET_ITEM(seq, i);
+    BOCCown *c = (BOCCown *)PyLong_AsVoidPtr(item);
+    if (c != NULL) {
+      COWN_DECREF(c);
+    } else {
+      PyErr_Clear();
+    }
+  }
+  PyMem_RawFree(pins);
+  Py_DECREF(seq);
+  return -1;
+}
+
+// ---------------------------------------------------------------------------
+// Mutations.
+// ---------------------------------------------------------------------------
+
+int noticeboard_write(const char *key, Py_ssize_t key_len, XIDATA_T *xidata,
+                      bool pickled, BOCCown **pins, int pin_count) {
+  if (key_len >= NB_KEY_SIZE) {
+    PyErr_SetString(PyExc_ValueError,
+                    "noticeboard key too long (max 63 UTF-8 bytes)");
+    goto fail;
+  }
+  if (memchr(key, '\0', (size_t)key_len) != NULL) {
+    PyErr_SetString(PyExc_ValueError,
+                    "noticeboard key must not contain NUL characters");
+    goto fail;
+  }
+
+  mtx_lock(&NB.mutex);
+
+  NoticeboardEntry *target = NULL;
+  for (int i = 0; i < NB.count; i++) {
+    if (strncmp(NB.entries[i].key, key, NB_KEY_SIZE) == 0) {
+      target = &NB.entries[i];
+      break;
+    }
+  }
+
+  if (target == NULL) {
+    if (NB.count >= NB_MAX_ENTRIES) {
+      mtx_unlock(&NB.mutex);
+      PyErr_SetString(PyExc_RuntimeError, "Noticeboard is full (max 64)");
+      goto fail;
+    }
+    target = &NB.entries[NB.count++];
+    strncpy(target->key, key, NB_KEY_SIZE - 1);
+    target->key[NB_KEY_SIZE - 1] = '\0';
+    target->value = NULL;
+    target->pinned_cowns = NULL;
+    target->pinned_count = 0;
+  }
+
+  // Stash old value and old pins to free after releasing the mutex —
+  // XIDATA_FREE / COWN_DECREF may invoke Python __del__ which could
+  // re-enter the noticeboard.
+  XIDATA_T *old_value = target->value;
+  BOCCown **old_pins = target->pinned_cowns;
+  int old_pin_count = target->pinned_count;
+
+  target->value = xidata;
+  target->pickled = pickled;
+  target->pinned_cowns = pins;
+  target->pinned_count = pin_count;
+
+  atomic_fetch_add(&NB_VERSION, 1);
+
+  mtx_unlock(&NB.mutex);
+
+  if (old_value != NULL) {
+    XIDATA_FREE(old_value);
+  }
+  if (old_pins != NULL) {
+    for (int i = 0; i < old_pin_count; i++) {
+      COWN_DECREF(old_pins[i]);
+    }
+    PyMem_RawFree(old_pins);
+  }
+  return 0;
+
+fail:
+  // Roll back: free the new XIData and decref the new pins.
+  if (xidata != NULL) {
+    XIDATA_FREE(xidata);
+  }
+  if (pins != NULL) {
+    for (int i = 0; i < pin_count; i++) {
+      COWN_DECREF(pins[i]);
+    }
+    PyMem_RawFree(pins);
+  }
+  return -1;
+}
+
+int noticeboard_delete(const char *key, Py_ssize_t key_len) {
+  if (key_len >= NB_KEY_SIZE) {
+    PyErr_SetString(PyExc_ValueError,
+                    "noticeboard key too long (max 63 UTF-8 bytes)");
+    return -1;
+  }
+  if (memchr(key, '\0', (size_t)key_len) != NULL) {
+    PyErr_SetString(PyExc_ValueError,
+                    "noticeboard key must not contain NUL characters");
+    return -1;
+  }
+
+  XIDATA_T *deleted_value = NULL;
+  BOCCown **deleted_pins = NULL;
+  int deleted_pin_count = 0;
+
+  mtx_lock(&NB.mutex);
+  int found = -1;
+  for (int i = 0; i < NB.count; i++) {
+    if (strncmp(NB.entries[i].key, key, NB_KEY_SIZE) == 0) {
+      found = i;
+      break;
+    }
+  }
+
+  if (found >= 0) {
+    deleted_value = NB.entries[found].value;
+    deleted_pins = NB.entries[found].pinned_cowns;
+    deleted_pin_count = NB.entries[found].pinned_count;
+
+    for (int i = found; i < NB.count - 1; i++) {
+      NB.entries[i] = NB.entries[i + 1];
+    }
+    memset(&NB.entries[NB.count - 1], 0, sizeof(NoticeboardEntry));
+    NB.count--;
+
+    atomic_fetch_add(&NB_VERSION, 1);
+  }
+  mtx_unlock(&NB.mutex);
+
+  if (deleted_value != NULL) {
+    XIDATA_FREE(deleted_value);
+  }
+  if (deleted_pins != NULL) {
+    for (int i = 0; i < deleted_pin_count; i++) {
+      COWN_DECREF(deleted_pins[i]);
+    }
+    PyMem_RawFree(deleted_pins);
+  }
+  return 0;
+}
+
+void noticeboard_clear(void) {
+  XIDATA_T *to_free[NB_MAX_ENTRIES];
+  int to_free_count = 0;
+  BOCCown **to_unpin[NB_MAX_ENTRIES];
+  int to_unpin_count[NB_MAX_ENTRIES];
+  int to_unpin_entries = 0;
+
+  mtx_lock(&NB.mutex);
+  for (int i = 0; i < NB.count; i++) {
+    if (NB.entries[i].value != NULL) {
+      to_free[to_free_count++] = NB.entries[i].value;
+      NB.entries[i].value = NULL;
+    }
+    if (NB.entries[i].pinned_cowns != NULL) {
+      to_unpin[to_unpin_entries] = NB.entries[i].pinned_cowns;
+      to_unpin_count[to_unpin_entries] = NB.entries[i].pinned_count;
+      to_unpin_entries++;
+      NB.entries[i].pinned_cowns = NULL;
+      NB.entries[i].pinned_count = 0;
+    }
+  }
+  NB.count = 0;
+  memset(NB.entries, 0, sizeof(NB.entries));
+  atomic_fetch_add(&NB_VERSION, 1);
+  mtx_unlock(&NB.mutex);
+
+  for (int i = 0; i < to_free_count; i++) {
+    XIDATA_FREE(to_free[i]);
+  }
+  for (int i = 0; i < to_unpin_entries; i++) {
+    for (int j = 0; j < to_unpin_count[i]; j++) {
+      COWN_DECREF(to_unpin[i][j]);
+    }
+    PyMem_RawFree(to_unpin[i]);
+  }
+
+  // Drop this thread's cache so a subsequent same-thread snapshot
+  // does not reuse a stale proxy. Other threads will revalidate via
+  // NB_VERSION.
+  noticeboard_drop_local_cache();
+}
+
+// ---------------------------------------------------------------------------
+// Snapshot.
+// ---------------------------------------------------------------------------
+
+PyObject *noticeboard_snapshot(PyObject *loads) {
+  if (NB_SNAPSHOT_PROXY != NULL) {
+    if (NB_VERSION_CHECKED) {
+      // Within-behavior repeat call: same proxy, no atomic load.
+      Py_INCREF(NB_SNAPSHOT_PROXY);
+      return NB_SNAPSHOT_PROXY;
+    }
+    // First snapshot call this behavior: do exactly one version check.
+    int_least64_t current = atomic_load(&NB_VERSION);
+    if (current == NB_SNAPSHOT_VERSION) {
+      NB_VERSION_CHECKED = true;
+      Py_INCREF(NB_SNAPSHOT_PROXY);
+      return NB_SNAPSHOT_PROXY;
+    }
+    noticeboard_drop_local_cache();
+  }
+
+  PyObject *dict = PyDict_New();
+  if (dict == NULL) {
+    return NULL;
+  }
+
+  // Deferred entries: pickled values whose bytes were extracted under
+  // mutex but need unpickling outside the lock.
+  PyObject *deferred_keys[NB_MAX_ENTRIES];
+  PyObject *deferred_bytes[NB_MAX_ENTRIES];
+  int deferred_count = 0;
+
+  // Keepalive pins: while we hold the mutex we take an extra
+  // COWN_INCREF on every pin reachable from a deferred (pickled)
+  // entry. The bytes we are about to unpickle outside the mutex
+  // contain raw BOCCown pointers whose validity depends on the
+  // entry's pin list. Without this extra ref, a concurrent writer
+  // could overwrite the entry the instant we drop the mutex, release
+  // the old pins, and free the BOCCowns before we touch them — UAF
+  // in _cown_capsule_from_pointer. Released after the deferred
+  // unpickling completes. Each deferred entry contributes a heap-
+  // allocated pin pointer array sized to its pin count.
+  BOCCown **keepalive_pins[NB_MAX_ENTRIES];
+  int keepalive_counts[NB_MAX_ENTRIES];
+  for (int i = 0; i < NB_MAX_ENTRIES; i++) {
+    keepalive_pins[i] = NULL;
+    keepalive_counts[i] = 0;
+  }
+
+  mtx_lock(&NB.mutex);
+
+  // Capture the noticeboard version while still holding the mutex so
+  // that no concurrent writer can bump it between snapshot completion
+  // and version capture.
+  int_least64_t built_version = atomic_load(&NB_VERSION);
+
+  for (int i = 0; i < NB.count; i++) {
+    NoticeboardEntry *entry = &NB.entries[i];
+    if (entry->value == NULL) {
+      continue;
+    }
+
+    // XIDATA_NEWOBJECT is lightweight (no Python code execution).
+    PyObject *raw = XIDATA_NEWOBJECT(entry->value);
+    if (raw == NULL) {
+      mtx_unlock(&NB.mutex);
+      goto fail_deferred;
+    }
+
+    PyObject *key = PyUnicode_FromString(entry->key);
+    if (key == NULL) {
+      Py_DECREF(raw);
+      mtx_unlock(&NB.mutex);
+      goto fail_deferred;
+    }
+
+    if (!entry->pickled) {
+      // Non-pickled: add directly to dict.
+      if (PyDict_SetItem(dict, key, raw) < 0) {
+        Py_DECREF(key);
+        Py_DECREF(raw);
+        mtx_unlock(&NB.mutex);
+        goto fail_deferred;
+      }
+      Py_DECREF(key);
+      Py_DECREF(raw);
+    } else {
+      // Pickled: defer unpickling to outside the mutex. Take a fresh
+      // COWN_INCREF on every pin so the BOCCowns referenced by the
+      // bytes survive past mtx_unlock — see keepalive_pins comment.
+      if (entry->pinned_count > 0) {
+        BOCCown **pins = (BOCCown **)PyMem_RawMalloc(sizeof(BOCCown *) *
+                                                     entry->pinned_count);
+        if (pins == NULL) {
+          Py_DECREF(key);
+          Py_DECREF(raw);
+          mtx_unlock(&NB.mutex);
+          PyErr_NoMemory();
+          goto fail_deferred;
+        }
+        for (int j = 0; j < entry->pinned_count; j++) {
+          pins[j] = entry->pinned_cowns[j];
+          COWN_INCREF(pins[j]);
+        }
+        keepalive_pins[deferred_count] = pins;
+        keepalive_counts[deferred_count] = entry->pinned_count;
+      }
+      deferred_keys[deferred_count] = key;
+      deferred_bytes[deferred_count] = raw;
+      deferred_count++;
+    }
+  }
+
+  mtx_unlock(&NB.mutex);
+
+  // Unpickle deferred entries outside the mutex.
+  for (int i = 0; i < deferred_count; i++) {
+    PyObject *value = PyObject_CallOneArg(loads, deferred_bytes[i]);
+    Py_DECREF(deferred_bytes[i]);
+    deferred_bytes[i] = NULL;
+
+    if (value == NULL) {
+      Py_DECREF(deferred_keys[i]);
+      deferred_keys[i] = NULL;
+      // Clean up remaining deferred entries.
+      for (int j = i + 1; j < deferred_count; j++) {
+        Py_DECREF(deferred_keys[j]);
+        Py_DECREF(deferred_bytes[j]);
+      }
+      // Release every keepalive pin (including the one for this
+      // entry).
+      for (int j = 0; j < deferred_count; j++) {
+        if (keepalive_pins[j] != NULL) {
+          for (int k = 0; k < keepalive_counts[j]; k++) {
+            COWN_DECREF(keepalive_pins[j][k]);
+          }
+          PyMem_RawFree(keepalive_pins[j]);
+          keepalive_pins[j] = NULL;
+        }
+      }
+      Py_DECREF(dict);
+      return NULL;
+    }
+
+    if (PyDict_SetItem(dict, deferred_keys[i], value) < 0) {
+      Py_DECREF(deferred_keys[i]);
+      Py_DECREF(value);
+      for (int j = i + 1; j < deferred_count; j++) {
+        Py_DECREF(deferred_keys[j]);
+        Py_DECREF(deferred_bytes[j]);
+      }
+      for (int j = 0; j < deferred_count; j++) {
+        if (keepalive_pins[j] != NULL) {
+          for (int k = 0; k < keepalive_counts[j]; k++) {
+            COWN_DECREF(keepalive_pins[j][k]);
+          }
+          PyMem_RawFree(keepalive_pins[j]);
+          keepalive_pins[j] = NULL;
+        }
+      }
+      Py_DECREF(dict);
+      return NULL;
+    }
+
+    Py_DECREF(deferred_keys[i]);
+    Py_DECREF(value);
+
+    // Successful unpickle: the snapshot dict (and its CownCapsules)
+    // now hold their own refs on every BOCCown referenced by the
+    // bytes. Drop our keepalive pin for this entry.
+    if (keepalive_pins[i] != NULL) {
+      for (int k = 0; k < keepalive_counts[i]; k++) {
+        COWN_DECREF(keepalive_pins[i][k]);
+      }
+      PyMem_RawFree(keepalive_pins[i]);
+      keepalive_pins[i] = NULL;
+    }
+  }
+
+  PyObject *proxy = PyDictProxy_New(dict);
+  if (proxy == NULL) {
+    Py_DECREF(dict);
+    return NULL;
+  }
+
+  // The proxy holds a strong reference to dict; we keep our own as
+  // well so that the dict is reachable for direct mutation in the
+  // rebuild path and the proxy survives at least as long as the dict.
+  NB_SNAPSHOT_CACHE = dict;
+  NB_SNAPSHOT_PROXY = proxy;
+  NB_SNAPSHOT_VERSION = built_version;
+  NB_VERSION_CHECKED = true;
+  Py_INCREF(proxy);
+  return proxy;
+
+fail_deferred:
+  for (int i = 0; i < deferred_count; i++) {
+    Py_DECREF(deferred_keys[i]);
+    Py_DECREF(deferred_bytes[i]);
+    if (keepalive_pins[i] != NULL) {
+      for (int k = 0; k < keepalive_counts[i]; k++) {
+        COWN_DECREF(keepalive_pins[i][k]);
+      }
+      PyMem_RawFree(keepalive_pins[i]);
+      keepalive_pins[i] = NULL;
+    }
+  }
+  Py_DECREF(dict);
+  return NULL;
+}
+
+// ---------------------------------------------------------------------------
+// notice_sync barrier.
+// ---------------------------------------------------------------------------
+
+int_least64_t notice_sync_request(void) {
+  return atomic_fetch_add(&NB_SYNC_REQUESTED, 1) + 1;
+}
+
+void notice_sync_complete(int_least64_t seq) {
+  mtx_lock(&NB_SYNC_MUTEX);
+  // Defense in depth: with a single noticeboard thread draining the
+  // FIFO boc_noticeboard tag, `seq` arrives strictly monotonically
+  // and a plain `atomic_store(seq)` would be correct. We keep the
+  // max-of pattern so that if a future change introduces a second
+  // mutator thread or any out-of-order delivery, NB_SYNC_PROCESSED
+  // can never regress and unblock waiters early.
+  int_least64_t cur = atomic_load(&NB_SYNC_PROCESSED);
+  if (seq > cur) {
+    atomic_store(&NB_SYNC_PROCESSED, seq);
+  }
+  cnd_broadcast(&NB_SYNC_COND);
+  mtx_unlock(&NB_SYNC_MUTEX);
+}
+
+bool notice_sync_wait(int_least64_t seq, double timeout, bool wait_forever) {
+  bool ok = true;
+  double end_time = wait_forever ? 0.0 : boc_now_s() + timeout;
+
+  mtx_lock(&NB_SYNC_MUTEX);
+  while (atomic_load(&NB_SYNC_PROCESSED) < seq) {
+    if (!wait_forever) {
+      double now = boc_now_s();
+      if (now >= end_time) {
+        ok = false;
+        break;
+      }
+      cnd_timedwait_s(&NB_SYNC_COND, &NB_SYNC_MUTEX, end_time - now);
+    } else {
+      cnd_wait(&NB_SYNC_COND, &NB_SYNC_MUTEX);
+    }
+  }
+  mtx_unlock(&NB_SYNC_MUTEX);
+  return ok;
+}
diff --git a/src/bocpy/noticeboard.h b/src/bocpy/noticeboard.h
new file mode 100644
index 0000000..0d097c6
--- /dev/null
+++ b/src/bocpy/noticeboard.h
@@ -0,0 +1,156 @@
+/// @file noticeboard.h
+/// @brief Public API for the global cross-behavior key-value noticeboard.
+///
+/// The noticeboard is a fixed-capacity table (max @ref NB_MAX_ENTRIES
+/// entries, each keyed by a UTF-8 string of up to @ref NB_KEY_SIZE-1
+/// bytes) holding cross-interpreter data plus a list of pinned
+/// @ref BOCCown references that the entry's value depends on.
+///
+/// **Thread model.** All mutations (@ref noticeboard_write,
+/// @ref noticeboard_delete, @ref noticeboard_clear) must be called
+/// from the **noticeboard thread** registered via
+/// @ref noticeboard_set_thread; the runtime guarantees this single-
+/// writer invariant, which removes the TOCTOU window from
+/// Python-level read-modify-write helpers (e.g. @c notice_update).
+/// Snapshot reads (@ref noticeboard_snapshot) are unrestricted —
+/// readers cache the result thread-locally and revalidate against
+/// @ref noticeboard_version once per behavior boundary.
+///
+/// **PyErr discipline.** Functions that interact with the Python C
+/// API (@ref noticeboard_snapshot, @ref nb_pin_cowns,
+/// @ref noticeboard_write, @ref noticeboard_delete) set a Python
+/// exception and return -1 / NULL on failure. Functions that are
+/// pure C (@ref noticeboard_clear, @ref noticeboard_version,
+/// @ref notice_sync_*) cannot fail.
+
+#ifndef BOCPY_NOTICEBOARD_H
+#define BOCPY_NOTICEBOARD_H
+
+#define PY_SSIZE_T_CLEAN
+
+#include <Python.h>
+#include <stdbool.h>
+#include <stdint.h>
+
+#include "compat.h"
+#include "cown.h"
+#include "xidata.h"
+
+/// @brief Maximum number of entries the noticeboard can hold.
+#define NB_MAX_ENTRIES 64
+
+/// @brief Maximum size of a key, including the trailing NUL byte.
+#define NB_KEY_SIZE 64
+
+/// @brief Initialize the noticeboard's mutex and notice_sync primitives.
+/// @details Called once at module init.
+void noticeboard_init(void);
+
+/// @brief Drain remaining entries (XIData + pins) and tear down primitives.
+/// @details Called once at module free. Drops the calling thread's
+/// snapshot cache and frees every entry's @c XIDATA_T plus every
+/// pinned cown ref.
+void noticeboard_destroy(void);
+
+/// @brief Register the calling thread as the sole noticeboard mutator.
+/// @details Returns 0 on success, -1 if a different thread is already
+/// registered (PyErr set). Idempotent for the same thread.
+int noticeboard_set_thread(void);
+
+/// @brief Forget the registered noticeboard mutator thread.
+/// @details Used during Python @c Behaviors.stop after the noticeboard
+/// thread has joined. Always succeeds.
+void noticeboard_clear_thread(void);
+
+/// @brief Reject a noticeboard mutation called from the wrong thread.
+/// @details Returns 0 if the calling thread is the registered mutator
+/// (or if no mutator has been registered yet — covers test/main-thread
+/// startup). Returns -1 with a Python @c RuntimeError set otherwise.
+/// @param op_name The operation name to embed in the error message.
+int noticeboard_check_thread(const char *op_name);
+
+/// @brief Drop the calling thread's cached snapshot dict and proxy.
+void noticeboard_drop_local_cache(void);
+
+/// @brief Mark the calling thread's cache as needing one version check.
+/// @details Called by the worker loop at every behavior boundary so
+/// the next @ref noticeboard_snapshot in this thread does exactly one
+/// atomic load against @ref noticeboard_version before reusing the
+/// cached proxy. Cheaper than dropping the cache outright.
+void noticeboard_cache_clear_for_behavior(void);
+
+/// @brief Read the noticeboard's monotonic version counter.
+int_least64_t noticeboard_version(void);
+
+/// @brief Walk a Python sequence of integer cown pointers, returning the
+///        underlying @ref BOCCown array.
+/// @details Each pointer in @p cowns is interpreted as a raw
+/// @ref BOCCown pointer (via @c PyLong_AsVoidPtr). The caller is
+/// expected to have pre-INCREFed each cown before passing the
+/// sequence in (the noticeboard adopts those refs on success). On
+/// failure, every transferred ref is rolled back and the output
+/// pointer is left NULL.
+/// @param cowns Sequence of integer pointer values, or @c Py_None.
+/// @param[out] out_array Heap-allocated array (PyMem_RawMalloc) of
+///        cown pointers. The caller is responsible for freeing it
+///        with @c PyMem_RawFree.
+/// @param[out] out_count Number of valid entries in @p out_array.
+/// @return 0 on success, -1 on failure (PyErr set).
+int nb_pin_cowns(PyObject *cowns, BOCCown ***out_array, int *out_count);
+
+/// @brief Write or overwrite a noticeboard entry.
+/// @details On success, the noticeboard takes ownership of @p xidata
+/// and the @p pins array (and the strong refs the caller pre-INCREFed
+/// onto each cown). On failure, @p xidata is freed via @c XIDATA_FREE
+/// and every pin is COWN_DECREFed before @c PyMem_RawFree(@p pins).
+/// @param key UTF-8 key (must be NUL-free, up to @ref NB_KEY_SIZE-1
+///        bytes long).
+/// @param key_len Length of @p key in bytes (does NOT include any
+///        trailing NUL).
+/// @param xidata Serialized value; ownership transferred on success.
+/// @param pickled Whether @p xidata holds pickled bytes.
+/// @param pins Heap-allocated cown pin array; ownership transferred
+///        on success. May be NULL when @p pin_count is 0.
+/// @param pin_count Number of entries in @p pins.
+/// @return 0 on success, -1 on failure (PyErr set; @p xidata and
+///         @p pins are freed).
+int noticeboard_write(const char *key, Py_ssize_t key_len, XIDATA_T *xidata,
+                      bool pickled, BOCCown **pins, int pin_count);
+
+/// @brief Delete a single noticeboard entry by key.
+/// @details The entry's @c XIDATA_T is freed and all pinned cowns are
+/// COWN_DECREFed (after the noticeboard mutex is released). It is not
+/// an error for the key to be absent.
+/// @return 0 on success, -1 on failure (PyErr set; e.g. key validation).
+int noticeboard_delete(const char *key, Py_ssize_t key_len);
+
+/// @brief Drop every entry, freeing XIData and pins.
+/// @details Bumps @ref noticeboard_version. Cannot fail.
+void noticeboard_clear(void);
+
+/// @brief Build (or reuse) the calling thread's read-only snapshot proxy.
+/// @details See @ref noticeboard_snapshot_doc for cache semantics. The
+/// returned proxy holds a strong reference to a dict that maps every
+/// noticeboard key to the deserialized value. Pickled values are
+/// unpickled outside the noticeboard mutex using @p loads as the
+/// callable.
+/// @param loads The @c pickle.loads callable (caller-owned reference).
+/// @return New strong reference to the proxy, or NULL on failure
+///         (PyErr set).
+PyObject *noticeboard_snapshot(PyObject *loads);
+
+/// @brief Reserve a fresh notice_sync sequence number.
+int_least64_t notice_sync_request(void);
+
+/// @brief Mark @p seq as processed and wake any @ref notice_sync_wait
+///        callers.
+void notice_sync_complete(int_least64_t seq);
+
+/// @brief Block the calling thread until @p seq has been processed.
+/// @param seq The sequence number returned by @ref notice_sync_request.
+/// @param timeout Maximum wait in seconds. Ignored if @p wait_forever.
+/// @param wait_forever If true, ignore @p timeout and wait until signalled.
+/// @return true if @p seq has been processed, false on timeout.
+bool notice_sync_wait(int_least64_t seq, double timeout, bool wait_forever);
+
+#endif // BOCPY_NOTICEBOARD_H
diff --git a/src/bocpy/sched.c b/src/bocpy/sched.c
new file mode 100644
index 0000000..c555a2e
--- /dev/null
+++ b/src/bocpy/sched.c
@@ -0,0 +1,1383 @@
+// sched.c — Work-stealing scheduler.
+//
+// Owns the per-worker MPMC queues, parking protocol, work-stealing,
+// and per-worker fairness tokens.
+//
+// Verona reference: `verona-rt/src/rt/sched/schedulerstats.h` (counter
+// POD subset), `mpmcq.h` (MPMC queue), `schedulerthread.h`
+// (`get_work` / `try_steal` / `steal`), `threadpool.h` (per-start
+// `incarnation` counter; pause/unpause epoch protocol), and
+// `core.h` (fairness token).
+
+#include "sched.h"
+
+#include <assert.h>
+#include <stdbool.h>
+#include <stddef.h>
+#include <stdint.h>
+#include <string.h>
+
+#include <Python.h>
+
+// ===========================================================================
+// Verona MPMC behaviour queue (`boc_bq_*`) — port of
+// `verona-rt/src/rt/sched/mpmcq.h`. Memory orderings match Verona
+// line-for-line. Cited line numbers refer to that file.
+// ===========================================================================
+
+void boc_bq_init(boc_bq_t *q) {
+  // Empty representation: back == &front, front == NULL (mpmcq.h:33-37).
+  // Use relaxed stores during init: callers must publish the queue
+  // through their own release edge before any thread observes it.
+  boc_atomic_store_ptr_explicit(&q->front, NULL, BOC_MO_RELAXED);
+  boc_atomic_store_ptr_explicit(&q->back, &q->front, BOC_MO_RELAXED);
+}
+
+void boc_bq_destroy_assert_empty(boc_bq_t *q) {
+  // Mirrors ~MPMCQ (mpmcq.h:213-217).
+  assert(boc_bq_is_empty(q));
+  (void)q;
+}
+
+boc_bq_node_t *boc_bq_acquire_front(boc_bq_t *q) {
+  // Mirrors MPMCQ::acquire_front (mpmcq.h:41-56).
+  BOC_SCHED_YIELD();
+
+  // Nothing in the queue (mpmcq.h:46).
+  if (boc_atomic_load_ptr_explicit(&q->front, BOC_MO_RELAXED) == NULL) {
+    return NULL;
+  }
+
+  BOC_SCHED_YIELD();
+
+  // Remove head element. This is like locking the queue for other
+  // removals (mpmcq.h:55).
+  return (boc_bq_node_t *)boc_atomic_exchange_ptr_explicit(&q->front, NULL,
+                                                           BOC_MO_ACQUIRE);
+}
+
+void boc_bq_enqueue_segment(boc_bq_t *q, boc_bq_segment_t s) {
+  // Mirrors MPMCQ::enqueue_segment (mpmcq.h:97-115).
+  BOC_SCHED_YIELD();
+
+  // The element we are writing into must have its next pointer NULL
+  // before the back-exchange (mpmcq.h:103); writes to the segment's
+  // tail link use relaxed because the publish below carries the
+  // happens-before edge.
+  boc_atomic_store_ptr_explicit(s.end, NULL, BOC_MO_RELAXED);
+
+  BOC_SCHED_YIELD();
+
+  boc_atomic_ptr_t *b = (boc_atomic_ptr_t *)boc_atomic_exchange_ptr_explicit(
+      &q->back, s.end, BOC_MO_ACQ_REL);
+
+  BOC_SCHED_YIELD();
+
+  // The previous back's slot must currently be NULL (its enqueuer set
+  // it that way); we now publish our segment's start there with a
+  // release store so consumers reading through next_in_queue with
+  // acquire see all the segment's writes (mpmcq.h:113).
+  assert(boc_atomic_load_ptr_explicit(b, BOC_MO_RELAXED) == NULL);
+  boc_atomic_store_ptr_explicit(b, s.start, BOC_MO_RELEASE);
+}
+
+void boc_bq_enqueue(boc_bq_t *q, boc_bq_node_t *n) {
+  // Mirrors MPMCQ::enqueue (mpmcq.h:118-121).
+  boc_bq_segment_t s = {n, &n->next_in_queue};
+  boc_bq_enqueue_segment(q, s);
+}
+
+void boc_bq_enqueue_front(boc_bq_t *q, boc_bq_node_t *n) {
+  // Mirrors MPMCQ::enqueue_front (mpmcq.h:123-135).
+  boc_bq_node_t *old_front = boc_bq_acquire_front(q);
+  if (old_front == NULL) {
+    // Post to back (mpmcq.h:128).
+    boc_bq_enqueue(q, n);
+    return;
+  }
+
+  // Link into the front (mpmcq.h:132-134).
+  boc_atomic_store_ptr_explicit(&n->next_in_queue, old_front, BOC_MO_RELAXED);
+  boc_atomic_store_ptr_explicit(&q->front, n, BOC_MO_RELEASE);
+}
+
+boc_bq_node_t *boc_bq_dequeue(boc_bq_t *q) {
+  // Mirrors MPMCQ::dequeue (mpmcq.h:140-184).
+  boc_bq_node_t *old_front = boc_bq_acquire_front(q);
+
+  BOC_SCHED_YIELD();
+
+  // Queue is empty or someone else is stealing (mpmcq.h:147-150).
+  if (old_front == NULL) {
+    return NULL;
+  }
+
+  boc_bq_node_t *new_front = (boc_bq_node_t *)boc_atomic_load_ptr_explicit(
+      &old_front->next_in_queue, BOC_MO_ACQUIRE);
+
+  BOC_SCHED_YIELD();
+
+  if (new_front != NULL) {
+    // Remove one element from the queue (mpmcq.h:158-160).
+    boc_atomic_store_ptr_explicit(&q->front, new_front, BOC_MO_RELEASE);
+    return old_front;
+  }
+
+  BOC_SCHED_YIELD();
+
+  // Queue contains a single element, attempt to close the queue
+  // (mpmcq.h:165-176). The expected `back` value is the address of the
+  // singleton node's `next_in_queue` slot; the desired value is the
+  // address of `q->front`, restoring the empty representation.
+  void *expected = &old_front->next_in_queue;
+  if (boc_atomic_compare_exchange_strong_ptr_explicit(
+          &q->back, &expected, &q->front, BOC_MO_ACQ_REL, BOC_MO_RELAXED)) {
+    return old_front;
+  }
+
+  BOC_SCHED_YIELD();
+
+  // Failed to close the queue, something is being added; restore the
+  // front and let the caller retry (mpmcq.h:181-183).
+  boc_atomic_store_ptr_explicit(&q->front, old_front, BOC_MO_RELEASE);
+  return NULL;
+}
+
+boc_bq_segment_t boc_bq_dequeue_all(boc_bq_t *q) {
+  // Mirrors MPMCQ::dequeue_all (mpmcq.h:189-203).
+  boc_bq_node_t *old_front = boc_bq_acquire_front(q);
+
+  // Queue is empty or someone else is popping (mpmcq.h:194-197).
+  if (old_front == NULL) {
+    boc_bq_segment_t empty = {NULL, NULL};
+    return empty;
+  }
+
+  BOC_SCHED_YIELD();
+
+  boc_atomic_ptr_t *old_back =
+      (boc_atomic_ptr_t *)boc_atomic_exchange_ptr_explicit(&q->back, &q->front,
+                                                           BOC_MO_ACQ_REL);
+
+  BOC_SCHED_YIELD();
+
+  boc_bq_segment_t out = {old_front, old_back};
+  return out;
+}
+
+boc_bq_node_t *boc_bq_segment_take_one(boc_bq_segment_t *s) {
+  // Mirrors MPMCQ::Segment::take_one (mpmcq.h:67-89).
+  boc_bq_node_t *n = s->start;
+  if (n == NULL) {
+    return NULL;
+  }
+
+  BOC_SCHED_YIELD();
+
+  boc_bq_node_t *next = (boc_bq_node_t *)boc_atomic_load_ptr_explicit(
+      &n->next_in_queue, BOC_MO_ACQUIRE);
+  if (next == NULL) {
+    return NULL;
+  }
+
+  s->start = next;
+  return n;
+}
+
+bool boc_bq_is_empty(boc_bq_t *q) {
+  // Mirrors MPMCQ::is_empty (mpmcq.h:206-210).
+  BOC_SCHED_YIELD();
+  return boc_atomic_load_ptr_explicit(&q->back, BOC_MO_RELAXED) == &q->front;
+}
+
+// ===========================================================================
+// Per-worker scheduler state
+// ===========================================================================
+
+// The per-worker struct (`boc_sched_worker_t`) is defined in `sched.h`
+// so dispatch and pop call sites can refer to its fields without an
+// extra indirection. Cacheline padding and `static_assert`s live with
+// the type definition.
+
+// ---------------------------------------------------------------------------
+// File-scope state
+// ---------------------------------------------------------------------------
+
+/// @brief Per-worker array, length @ref WORKER_COUNT. NULL when the
+///        scheduler module is in the down state.
+static boc_sched_worker_t *WORKERS = NULL;
+
+/// @brief Length of @ref WORKERS. Zero when in the down state.
+///
+/// Atomic so off-worker producers in @c boc_sched_dispatch can
+/// acquire-load it and observe the runtime-down sentinel (0)
+/// before they could observe the freed @ref WORKERS array. The
+/// shutdown side release-stores 0 here to publish that ordering;
+/// the dispatch side acquire-loads to consume it. Worker-internal
+/// reads (loop bounds, registration overflow) use relaxed loads
+/// because the worker-shutdown handshake serialises them against
+/// the @ref boc_sched_shutdown store. The underlying value is
+/// non-negative; we use @c u64 for type-uniformity with the rest
+/// of the atomic block.
+static boc_atomic_u64_t WORKER_COUNT = 0;
+
+/// @brief Per-start incarnation counter. Atomic for the same reason
+/// as @ref WORKER_COUNT: off-worker producers acquire-load it to
+/// detect a start/stop/start cycle and self-invalidate their
+/// @c rr_nonlocal TLS. The shutdown side release-stores the bumped
+/// value (paired with @ref WORKER_COUNT = 0) so a producer that
+/// reads the new incarnation cannot observe a freed @ref WORKERS
+/// slot. Initialisation reads/writes are relaxed because they
+/// happen with no concurrent producers.
+static boc_atomic_u64_t INCARNATION = 0;
+
+// ---------------------------------------------------------------------------
+// Per-thread state (TLS)
+// ---------------------------------------------------------------------------
+//
+// Each scheduler-aware thread (worker sub-interpreter, or any other
+// thread that calls boc_sched_dispatch from a worker context) keeps
+// its dispatch state in TLS slots rather than in `boc_sched_worker_t`
+// fields. The bocpy precedent: this matches `noticeboard.c`'s
+// `nb_cache_*` thread-locals. Verona equivalent: the same fields
+// are members of `SchedulerThread`, which is itself one-per-OS-thread
+// — TLS is the same effect with one fewer indirection.
+//
+// All slots use the `compat.h` `thread_local` macro (`_Thread_local`
+// on POSIX, `__declspec(thread)` on MSVC) with the **default** TLS
+// model.
+
+/// @brief This thread's worker handle, or NULL if the thread has not
+///        called @ref boc_sched_worker_register.
+/// @details Read by the producer-locality fast path of
+/// @c boc_sched_dispatch and by `boc_sched_worker_pop_*`. NULL on
+/// threads that schedule from outside the worker pool (the main
+/// thread); those callers take the round-robin arm.
+static thread_local boc_sched_worker_t *current_worker = NULL;
+
+/// @brief Pending fast-slot for the producer-locality dispatch path.
+/// @details Stores a @c boc_bq_node_t pointer rather than a
+/// @c BOCBehavior pointer to keep this TU decoupled from the
+/// @c BOCBehavior struct layout. The consumer in @c _core.c converts
+/// the node back to its owning behaviour via the
+/// @c BEHAVIOR_FROM_BQ_NODE container_of macro.
+static thread_local boc_bq_node_t *pending = NULL;
+
+/// @brief Consumer-side batch countdown.
+/// @details Verona `schedulerthread.h:122-138`. The @c pending fast
+/// path (Verona `next_work`) is taken at most @ref BOC_BQ_BATCH_SIZE
+/// times in a row; once @c batch reaches 0 the next pop forces a
+/// @ref boc_bq_dequeue so a long producer-local chain cannot starve
+/// queued cross-worker (or cross-arm) work indefinitely. Reset to
+/// @ref BOC_BQ_BATCH_SIZE every time the queue path returns work.
+/// Seeded to @ref BOC_BQ_BATCH_SIZE inside
+/// @ref boc_sched_worker_register so the first pop on a freshly
+/// registered thread treats @c pending as fully eligible.
+static thread_local size_t batch = 0;
+
+/// @brief Round-robin cursor for off-worker producers; the re-seed is
+///        gated on the incarnation snapshot below.
+static thread_local boc_sched_worker_t *rr_nonlocal = NULL;
+
+/// @brief Snapshot of @ref INCARNATION at the time @c rr_nonlocal was
+///        last seeded. A mismatch on the next dispatch forces a
+///        re-seed (survives `start()`/`wait()`/`start()` cycles).
+static thread_local size_t rr_incarnation = 0;
+
+/// @brief Per-worker work-stealing victim cursor.
+/// @details Verona equivalent: `SchedulerThread::victim`
+/// (`schedulerthread.h:60`). Walks the worker ring independently of
+/// @c rr_nonlocal so a worker's victim choice does not depend on
+/// off-worker dispatch ordering. Seeded to @c self->next_in_ring on
+/// the first @ref boc_sched_try_steal call (lazy init keeps the
+/// register path zero-cost). NULL on threads that have not
+/// registered as workers — they never call @c try_steal.
+static thread_local boc_sched_worker_t *steal_victim = NULL;
+
+// ---------------------------------------------------------------------------
+// Worker registration counter
+// ---------------------------------------------------------------------------
+//
+// Atomic so multiple worker threads racing to claim slots in
+// `boc_sched_worker_register` do not collide. Reset to zero in
+// `boc_sched_init` so re-entry (`start()`/`wait()`/`start()`) starts
+// fresh at slot 0. Read with relaxed ordering — the consumers that
+// care about happens-before edges (the `current_worker` TLS write
+// and any subsequent dispatch) sequence themselves through
+// `WORKERS[slot]` which is itself zero-initialised by `boc_sched_init`
+// before this counter is reset.
+
+static boc_atomic_u32_t REGISTERED_COUNT = 0;
+
+// ---------------------------------------------------------------------------
+// Park/unpark protocol epochs
+// ---------------------------------------------------------------------------
+//
+// Port of Verona's two-epoch `pause`/`unpause` protocol
+// (`verona-rt/src/rt/sched/threadpool.h:282-379`).
+//
+// `PAUSE_EPOCH` is bumped (seq_cst) by a parker before its
+// `check_for_work` walk and `cv_mu` re-check; this is the
+// "speak now" point that forces any concurrent producer into the
+// CAS arm. `UNPAUSE_EPOCH` is CAS'd forward by a producer that
+// observes `PAUSE_EPOCH > UNPAUSE_EPOCH`; the CAS winner takes
+// responsibility for issuing one wake. `PARKED_COUNT` is a
+// fast-path skip — if zero, the producer's targeted-signal arm
+// does not need to consult the epochs at all.
+//
+// Reset to zero in `boc_sched_init`/`boc_sched_shutdown` so a fresh
+// runtime cycle starts with the invariant `PAUSE_EPOCH == UNPAUSE_EPOCH`
+// (no parker has spoken; producers take the fast arm).
+
+static boc_atomic_u64_t PAUSE_EPOCH = 0;
+static boc_atomic_u64_t UNPAUSE_EPOCH = 0;
+static boc_atomic_u32_t PARKED_COUNT = 0;
+
+// ---------------------------------------------------------------------------
+// Public API
+// ---------------------------------------------------------------------------
+
+int boc_sched_init(Py_ssize_t worker_count) {
+  // Defensive: refuse a leak if init is called twice without an
+  // intervening shutdown.
+  if (WORKERS != NULL) {
+    PyErr_SetString(PyExc_RuntimeError,
+                    "boc_sched_init called without prior shutdown");
+    return -1;
+  }
+
+  if (worker_count < 0) {
+    PyErr_SetString(PyExc_ValueError,
+                    "boc_sched_init: worker_count must be non-negative");
+    return -1;
+  }
+
+  if (worker_count > 0) {
+    // PyMem_RawCalloc (not PyMem_Calloc): the WORKERS array is
+    // process-global and is touched by every sub-interpreter worker
+    // thread. Since CPython 3.12 the object/Mem allocators are
+    // per-interpreter, so an allocation made in interpreter A would
+    // be invalid (and unfreeable) from interpreter B. The raw
+    // allocator is process-wide and GIL-independent. Zero-init gives
+    // every counter, every typed atomic slot (compat.h
+    // `boc_atomic_*_t` are layout-compatible with the underlying
+    // scalar; zero is the well-defined "false" / NULL / 0 state on
+    // every supported platform), and every reserved slot the correct
+    // starting value.
+    WORKERS = (boc_sched_worker_t *)PyMem_RawCalloc((size_t)worker_count,
+                                                    sizeof(boc_sched_worker_t));
+    if (WORKERS == NULL) {
+      PyErr_NoMemory();
+      return -1;
+    }
+
+    // Per-worker non-trivial initialisation: bq queue, mutex,
+    // condvar, owner-interp placeholder, and the ring-link.
+    // Mutex and condvar wrappers come from `compat.h` (pthread on
+    // POSIX, SRWLock / CONDITION_VARIABLE on MSVC).
+    for (Py_ssize_t i = 0; i < worker_count; ++i) {
+      boc_sched_worker_t *w = &WORKERS[i];
+      // Initialise all N sub-queues of the WSQ. Cursors are
+      // zero-initialised by the parent `PyMem_RawCalloc` of the
+      // WORKERS array; we re-set them here to make the invariant
+      // explicit and survive any future move to non-zeroing
+      // allocators.
+      for (size_t j = 0; j < (size_t)BOC_WSQ_N; ++j) {
+        boc_bq_init(&w->q[j]);
+      }
+      w->enqueue_index.idx = 0;
+      w->dequeue_index.idx = 0;
+      w->steal_index.idx = 0;
+      boc_mtx_init(&w->cv_mu);
+      cnd_init(&w->cv);
+      // owner_interp_id is set when the worker calls
+      // `boc_sched_worker_register`. -1 means "not yet registered".
+      w->owner_interp_id = -1;
+      // Ring-link: i -> i+1, last wraps to 0. Immutable after this
+      // point.
+      w->next_in_ring = &WORKERS[(i + 1) % worker_count];
+      // Verona `core.h:23`: `should_steal_for_fairness{true}` — every
+      // freshly-constructed Core starts with the flag set, so the
+      // first `get_work` call on each worker takes the fairness arm
+      // (which is what enqueues the token into the queue for the
+      // first time; nothing else seeds it). Release-store so a
+      // worker thread that subsequently reads it under acquire sees
+      // the initialised value.
+      boc_atomic_store_bool_explicit(&w->should_steal_for_fairness, true,
+                                     BOC_MO_RELEASE);
+    }
+  }
+
+  // Initial publish of WORKER_COUNT and INCARNATION. On the GIL
+  // build no concurrent producers can exist at this point (workers
+  // have not been spawned yet, and `start()` is single-threaded
+  // under the GIL), so plain stores would suffice. On the
+  // free-threaded build (PEP 703) an off-worker producer surviving
+  // a prior stop()/start() cycle can ACQUIRE-load WORKER_COUNT in
+  // `boc_sched_dispatch` and see the new non-zero value here. RELAXED
+  // stores would only synchronise that ACQUIRE with the previous
+  // shutdown's WORKER_COUNT = 0 RELEASE, leaving no happens-before
+  // edge with the per-slot `boc_bq_init` / `boc_mtx_init` writes
+  // above -- the producer could legally read `wc > 0` and then
+  // dereference a `WORKERS[i]` whose mutex is still in pre-init
+  // bytewise state. The same hazard applies to the INCARNATION
+  // re-seed: a producer ACQUIRE-loading the new incarnation must
+  // observe the new WORKERS pointer, not whatever was cached. Use
+  // RELEASE so init and shutdown publish-pair symmetrically with
+  // the dispatch-side ACQUIRE on every cycle.
+  boc_atomic_store_u64_explicit(&WORKER_COUNT, (uint64_t)worker_count,
+                                BOC_MO_RELEASE);
+  boc_atomic_store_u64_explicit(
+      &INCARNATION,
+      boc_atomic_load_u64_explicit(&INCARNATION, BOC_MO_RELAXED) + 1,
+      BOC_MO_RELEASE);
+  // Re-entry safety: every start cycle starts slot allocation at 0.
+  // Done after WORKER_COUNT/WORKERS are valid so a racing register()
+  // (none expected at this point because workers have not been
+  // spawned yet, but defensively correct) sees a consistent state.
+  boc_atomic_store_u32_explicit(&REGISTERED_COUNT, 0, BOC_MO_RELAXED);
+  // Park/unpark protocol epochs: a fresh runtime cycle starts with
+  // the invariant PAUSE_EPOCH == UNPAUSE_EPOCH (no parker has spoken).
+  boc_atomic_store_u64_explicit(&PAUSE_EPOCH, 0, BOC_MO_RELAXED);
+  boc_atomic_store_u64_explicit(&UNPAUSE_EPOCH, 0, BOC_MO_RELAXED);
+  boc_atomic_store_u32_explicit(&PARKED_COUNT, 0, BOC_MO_RELAXED);
+  return 0;
+}
+
+void boc_sched_shutdown(void) {
+  // Order matters for the off-worker dispatch race.
+  // Off-worker producers in `boc_sched_dispatch` acquire-load
+  // WORKER_COUNT and treat 0 as the runtime-down sentinel. We must
+  // therefore publish WORKER_COUNT = 0 (and bump INCARNATION to
+  // self-invalidate any cached `rr_nonlocal` TLS in off-worker
+  // threads) BEFORE freeing the WORKERS array, otherwise a racing
+  // dispatch could dereference a freed slot.
+  Py_ssize_t old_count =
+      (Py_ssize_t)boc_atomic_load_u64_explicit(&WORKER_COUNT, BOC_MO_RELAXED);
+  // Release-store: pairs with the acquire-load in the off-worker
+  // arm of `boc_sched_dispatch`. A producer that observes
+  // WORKER_COUNT == 0 must NOT then observe a freed WORKERS slot;
+  // RELEASE here + ACQUIRE there gives that happens-before edge
+  // without an explicit `atomic_thread_fence`.
+  boc_atomic_store_u64_explicit(&WORKER_COUNT, 0, BOC_MO_RELEASE);
+  // Bump the incarnation so any thread-local `rr_nonlocal` cached
+  // by off-worker producers becomes self-invalidating; pairs with
+  // the acquire-load in `boc_sched_dispatch`. Doing this here (in
+  // addition to `boc_sched_init`) closes the start/stop/start
+  // window where a producer's TLS still holds the prior
+  // incarnation's worker pointer. RELEASE-store mirrors the
+  // WORKER_COUNT = 0 store above.
+  boc_atomic_store_u64_explicit(
+      &INCARNATION,
+      boc_atomic_load_u64_explicit(&INCARNATION, BOC_MO_RELAXED) + 1,
+      BOC_MO_RELEASE);
+  // No standalone fence needed: the RELEASE stores above already
+  // establish the happens-before edge with the dispatch-side
+  // ACQUIRE loads. Pairs with the acquire-load in the dispatch
+  // path.
+  if (WORKERS != NULL) {
+    // Per-worker teardown in reverse order. The bq must be empty at
+    // this point — `boc_bq_destroy_assert_empty` aborts if not.
+    for (Py_ssize_t i = old_count - 1; i >= 0; --i) {
+      boc_sched_worker_t *w = &WORKERS[i];
+      // Tear down all N sub-queues; each must be empty.
+      for (size_t j = 0; j < (size_t)BOC_WSQ_N; ++j) {
+        boc_bq_destroy_assert_empty(&w->q[j]);
+      }
+      cnd_destroy(&w->cv);
+      mtx_destroy(&w->cv_mu);
+    }
+    PyMem_RawFree(WORKERS);
+    WORKERS = NULL;
+  }
+  // Reset the registration counter so external observers see a
+  // clean post-stop state. Symmetric with the reset in
+  // `boc_sched_init`.
+  boc_atomic_store_u32_explicit(&REGISTERED_COUNT, 0, BOC_MO_RELAXED);
+}
+
+Py_ssize_t boc_sched_worker_count(void) {
+  return (Py_ssize_t)boc_atomic_load_u64_explicit(&WORKER_COUNT,
+                                                  BOC_MO_RELAXED);
+}
+
+boc_sched_worker_t *boc_sched_worker_at(Py_ssize_t worker_index) {
+  Py_ssize_t wc =
+      (Py_ssize_t)boc_atomic_load_u64_explicit(&WORKER_COUNT, BOC_MO_RELAXED);
+  if (worker_index < 0 || worker_index >= wc) {
+    return NULL;
+  }
+  return &WORKERS[worker_index];
+}
+
+int boc_sched_stats_snapshot(Py_ssize_t worker_index, boc_sched_stats_t *out) {
+  if (out == NULL) {
+    return -1;
+  }
+  Py_ssize_t wc =
+      (Py_ssize_t)boc_atomic_load_u64_explicit(&WORKER_COUNT, BOC_MO_RELAXED);
+  if (worker_index < 0 || worker_index >= wc) {
+    return -1;
+  }
+  // Best-effort relaxed snapshot. Each field is read independently;
+  // the snapshot may observe individual counter values from
+  // different points in time. Counters are monotonic, so a torn
+  // read between fields can only under-report -- never produce a
+  // value greater than the true count.
+  const boc_sched_stats_atomic_t *src = &WORKERS[worker_index].stats;
+  out->pushed_local = boc_atomic_load_u64_explicit(
+      (boc_atomic_u64_t *)&src->pushed_local, BOC_MO_RELAXED);
+  out->dispatched_to_pending = boc_atomic_load_u64_explicit(
+      (boc_atomic_u64_t *)&src->dispatched_to_pending, BOC_MO_RELAXED);
+  out->pushed_remote = boc_atomic_load_u64_explicit(
+      (boc_atomic_u64_t *)&src->pushed_remote, BOC_MO_RELAXED);
+  out->popped_local = boc_atomic_load_u64_explicit(
+      (boc_atomic_u64_t *)&src->popped_local, BOC_MO_RELAXED);
+  out->popped_via_steal = boc_atomic_load_u64_explicit(
+      (boc_atomic_u64_t *)&src->popped_via_steal, BOC_MO_RELAXED);
+  out->enqueue_cas_retries = boc_atomic_load_u64_explicit(
+      (boc_atomic_u64_t *)&src->enqueue_cas_retries, BOC_MO_RELAXED);
+  out->dequeue_cas_retries = boc_atomic_load_u64_explicit(
+      (boc_atomic_u64_t *)&src->dequeue_cas_retries, BOC_MO_RELAXED);
+  out->batch_resets = boc_atomic_load_u64_explicit(
+      (boc_atomic_u64_t *)&src->batch_resets, BOC_MO_RELAXED);
+  out->steal_attempts = boc_atomic_load_u64_explicit(
+      (boc_atomic_u64_t *)&src->steal_attempts, BOC_MO_RELAXED);
+  out->steal_failures = boc_atomic_load_u64_explicit(
+      (boc_atomic_u64_t *)&src->steal_failures, BOC_MO_RELAXED);
+  out->parked = boc_atomic_load_u64_explicit((boc_atomic_u64_t *)&src->parked,
+                                             BOC_MO_RELAXED);
+  out->last_steal_attempt_ns = boc_atomic_load_u64_explicit(
+      (boc_atomic_u64_t *)&src->last_steal_attempt_ns, BOC_MO_RELAXED);
+  out->fairness_arm_fires = boc_atomic_load_u64_explicit(
+      (boc_atomic_u64_t *)&src->fairness_arm_fires, BOC_MO_RELAXED);
+  return 0;
+}
+
+size_t boc_sched_incarnation_get(void) {
+  return (size_t)boc_atomic_load_u64_explicit(&INCARNATION, BOC_MO_RELAXED);
+}
+
+// ---------------------------------------------------------------------------
+// Per-worker registration
+// ---------------------------------------------------------------------------
+
+Py_ssize_t boc_sched_worker_register(void) {
+  // Allocate the next slot. Returns the *previous* value, so the
+  // first caller gets 0. Relaxed is fine: the only writer this races
+  // with is itself; downstream consumers reach the slot through a
+  // subsequent TLS write or through `boc_sched_stats_snapshot`, both
+  // of which are sequenced after this call returns.
+  uint32_t slot =
+      boc_atomic_fetch_add_u32_explicit(&REGISTERED_COUNT, 1, BOC_MO_RELAXED);
+  Py_ssize_t wc =
+      (Py_ssize_t)boc_atomic_load_u64_explicit(&WORKER_COUNT, BOC_MO_RELAXED);
+  if ((Py_ssize_t)slot >= wc) {
+    // Over-registration: roll back the counter so a subsequent
+    // (legitimate) registration would still succeed if a slot frees.
+    // Keeps the `registered_count == worker_count` invariant clean
+    // after a successful run.
+    boc_atomic_fetch_sub_u32_explicit(&REGISTERED_COUNT, 1, BOC_MO_RELAXED);
+    return -1;
+  }
+
+  // Stamp the slot's owner-witness with the calling sub-interpreter
+  // id. This is a debug aid and the wrong-thread assert hook;
+  // nothing reads it on a hot path.
+  PyInterpreterState *interp = PyInterpreterState_Get();
+  WORKERS[slot].owner_interp_id = (Py_ssize_t)PyInterpreterState_GetID(interp);
+
+  // Install the TLS handle. From here on, any dispatch / pop call on
+  // this thread finds its worker in O(1) without consulting the
+  // WORKERS array.
+  current_worker = &WORKERS[slot];
+
+  // Seed the consumer-side batch budget so the first pop on this
+  // thread can take pending without first draining the queue. The
+  // zero default would otherwise mis-classify the first pop as
+  // batch-exhausted and break Verona's `next_work` priority.
+  batch = BOC_BQ_BATCH_SIZE;
+  // Clear the steal victim cursor: it is lazy-initialised on the
+  // first try_steal call. A stale TLS pointer from a previous
+  // start cycle would point into a freed worker array.
+  steal_victim = NULL;
+  return (Py_ssize_t)slot;
+}
+
+boc_sched_worker_t *boc_sched_current_worker(void) { return current_worker; }
+
+// ---------------------------------------------------------------------------
+// Park/unpark protocol implementation
+// ---------------------------------------------------------------------------
+
+// Forward declaration: the slow steal helper is defined further down
+// (with `try_steal` and the quiescence-window machinery). `pop_slow`
+// calls it between the local-queue dequeue and the park, matching
+// Verona's `get_work` ordering (`schedulerthread.h:122-167`).
+static boc_bq_node_t *boc_sched_steal(boc_sched_worker_t *self);
+
+void boc_sched_signal_one(boc_sched_worker_t *target) {
+  if (target == NULL) {
+    return;
+  }
+  // Lock-then-signal: under cv_mu we serialise against the parker's
+  // epoch re-check. If the parker is between its re-check and the
+  // `parked = true` store, our signal would otherwise be lost; the
+  // mutex acquisition forces us to wait until either the parker has
+  // committed to sleep (and `cnd_signal` will wake it) or has bailed
+  // out (and our signal is harmless).
+  mtx_lock(&target->cv_mu);
+  cnd_signal(&target->cv);
+  mtx_unlock(&target->cv_mu);
+}
+
+void boc_sched_unpause_all(boc_sched_worker_t *self) {
+  Py_ssize_t wc =
+      (Py_ssize_t)boc_atomic_load_u64_explicit(&WORKER_COUNT, BOC_MO_RELAXED);
+  if (self == NULL || wc == 0) {
+    return;
+  }
+  // Cheap early-out: if no worker is parked, the walk would do
+  // WORKER_COUNT acquire-loads for nothing. The relaxed load is
+  // sufficient because a producer that observed PARKED_COUNT == 0
+  // and a parker that subsequently parks would, on the next
+  // producer's CAS-arm entry, re-publish (pe != ue forces another
+  // CAS and wake attempt). The protocol explicitly tolerates a
+  // stale zero here.
+  if (boc_atomic_load_u32_explicit(&PARKED_COUNT, BOC_MO_RELAXED) == 0) {
+    return;
+  }
+  // Broadcast wake: walk the entire ring starting from
+  // self->next_in_ring and signal every parked worker. Mirrors
+  // Verona's ThreadSync::unpause_all (threadsync.h:108-128,
+  // threadpool.h:367-373). Without the broadcast, a burst of
+  // producer publishes that all CAS-lose against a single winner
+  // would leave N-1 parkers asleep until they each happen to be
+  // signal-targeted by some later off-worker dispatch.
+  boc_sched_worker_t *w = self->next_in_ring;
+  for (Py_ssize_t i = 0; i < wc; ++i) {
+    if (boc_atomic_load_bool_explicit(&w->parked, BOC_MO_ACQUIRE)) {
+      boc_sched_signal_one(w);
+    }
+    w = w->next_in_ring;
+  }
+}
+
+void boc_sched_worker_request_stop_all(void) {
+  if (WORKERS == NULL) {
+    return;
+  }
+  Py_ssize_t wc =
+      (Py_ssize_t)boc_atomic_load_u64_explicit(&WORKER_COUNT, BOC_MO_RELAXED);
+  // Phase 1: set stop_requested on every worker (release store so a
+  // worker waking from cnd_wait observes the flag with acquire).
+  for (Py_ssize_t i = 0; i < wc; ++i) {
+    boc_atomic_store_bool_explicit(&WORKERS[i].stop_requested, true,
+                                   BOC_MO_RELEASE);
+  }
+  // Phase 2: signal every worker's condvar under its mutex. We use
+  // signal-per-worker rather than broadcast on a global condvar
+  // because the bocpy precedent is per-queue waiters; each worker
+  // has its own cv. The mutex acquisition serialises against any
+  // parker that is between its epoch re-check and the cnd_wait call.
+  for (Py_ssize_t i = 0; i < wc; ++i) {
+    boc_sched_signal_one(&WORKERS[i]);
+  }
+}
+
+boc_bq_node_t *boc_sched_worker_pop_slow(boc_sched_worker_t *self) {
+  // stop_requested is checked at the top of every loop iteration,
+  // BEFORE any pause_epoch bump, so a worker exiting on shutdown
+  // does not advance pause_epoch past unpause_epoch.
+  for (;;) {
+    if (boc_atomic_load_bool_explicit(&self->stop_requested, BOC_MO_ACQUIRE)) {
+      return NULL;
+    }
+
+    // ----- Steal-for-fairness arm -----
+    //
+    // Verona `schedulerthread.h::get_work:143-162`. When the
+    // fairness flag is set AND the local queue has at least one
+    // visible item, attempt one steal pass *before* draining the
+    // local queue. If the steal succeeds we still re-enqueue the
+    // token and return the stolen item; if it fails we fall through
+    // to the local dequeue. The flag is cleared *before* the token
+    // re-enqueue (Verona note: "Set the flag before rescheduling
+    // the token so that we don't have a race"). The token itself is
+    // installed by `_core_scheduler_runtime_start` and is never
+    // freed by this path; re-enqueue is a node operation only.
+    //
+    // Runs BEFORE the defensive `pending` check so the
+    // batch==0-forced-queue-drain fall-through from `pop_fast`
+    // (which leaves `pending` set when the gate trips) still pays
+    // the fairness tax.
+    //
+    // **WSQ cadence sensitivity.** The token is re-enqueued via
+    // `boc_wsq_enqueue` below, which pushes round-robin via
+    // `enqueue_index` and so rotates the token across the worker's
+    // `BOC_WSQ_N` sub-queues over time. Owner-side
+    // `boc_wsq_dequeue` scans sub-queues in `dequeue_index` order,
+    // so the token's consumption rate (and therefore the
+    // fairness-arm cadence) is proportional to the cursor
+    // desynchronisation between `enqueue_index` and `dequeue_index`
+    // rather than to absolute local work. This matches verona's
+    // design (verona's `Core` carries the same `WrapIndex<N>`
+    // cursors and re-enqueues its fairness token via `enqueue`); a
+    // regression that pinned the token to one sub-queue would shift
+    // `fairness_arm_fires` by a factor of `BOC_WSQ_N` without any
+    // test failure today.
+    if (boc_atomic_load_bool_explicit(&self->should_steal_for_fairness,
+                                      BOC_MO_ACQUIRE) &&
+        !boc_wsq_is_empty(self)) {
+      boc_atomic_fetch_add_u64_explicit(&self->stats.fairness_arm_fires, 1,
+                                        BOC_MO_RELAXED);
+      boc_bq_node_t *stolen = boc_sched_steal(self);
+      boc_atomic_store_bool_explicit(&self->should_steal_for_fairness, false,
+                                     BOC_MO_RELEASE);
+      boc_bq_node_t *tok = (boc_bq_node_t *)boc_atomic_load_ptr_explicit(
+          &self->token_work, BOC_MO_ACQUIRE);
+      if (tok != NULL) {
+        boc_wsq_enqueue(self, tok);
+      }
+      if (stolen != NULL) {
+        return stolen;
+      }
+    }
+
+    // Defensive: under normal flow `pop_fast` exhausts pending
+    // before falling through to `pop_slow`, but a future caller may
+    // enter slow-path directly (e.g. test harness). Honour pending
+    // first so we never park while an unconsumed thread-local item
+    // is sitting on this thread.
+    if (pending != NULL) {
+      boc_bq_node_t *n = pending;
+      pending = NULL;
+      return n;
+    }
+
+    // ----- Local-queue dequeue -----
+    //
+    // Verona `get_work:165`. With the fairness arm cleared (or
+    // skipped) this is the primary work source.
+    boc_bq_node_t *n = boc_wsq_dequeue(self);
+    if (n != NULL) {
+      return n;
+    }
+
+    // ----- Empty-queue steal arm -----
+    //
+    // Verona `get_work:171-178`: an empty local queue is treated
+    // "like receiving a token" — try a steal directly. bocpy bundles
+    // the multi-victim ring + quiescence-window backoff into
+    // `boc_sched_steal`; if it returns non-NULL we have a stolen
+    // node (and the splice contract has already moved any remainder
+    // onto self->q). Returning NULL is the signal to commit to the
+    // park below.
+    n = boc_sched_steal(self);
+    if (n != NULL) {
+      return n;
+    }
+
+    // ----- Park-attempt -----
+    //
+    // Snapshot UNPAUSE_EPOCH BEFORE bumping PAUSE_EPOCH (mirrors
+    // Verona `threadpool.h::pause:283-285`). The pre-bump snapshot
+    // closes a lost-wakeup race: a producer that publishes between
+    // our bump and the snapshot would otherwise advance UNPAUSE_EPOCH
+    // to the new pause_epoch, but our (post-bump) snapshot would
+    // already see the advanced value, causing the cv_mu re-check
+    // below to compare equal and park anyway, consuming the wake.
+    // With the pre-bump snapshot, the producer's CAS must advance
+    // past `ue_snap`, and the re-check observes the inequality and
+    // bails out of the park. Relaxed is sufficient because the
+    // seq_cst fetch_add on PAUSE_EPOCH that follows provides the
+    // total order with the producer's load of PAUSE_EPOCH.
+    uint64_t ue_snap =
+        boc_atomic_load_u64_explicit(&UNPAUSE_EPOCH, BOC_MO_RELAXED);
+
+    // Bump PAUSE_EPOCH so any concurrent producer sees pe != ue and
+    // is forced into the CAS arm. seq_cst is required: the increment
+    // must totally-order with the producer's load-acquire of
+    // PAUSE_EPOCH.
+    boc_atomic_fetch_add_u64_explicit(&PAUSE_EPOCH, 1, BOC_MO_SEQ_CST);
+
+    // check_for_work: walks ALL workers via
+    // `boc_sched_any_work_visible()`. Cheap: one acquire-load per
+    // queue, no global lock. A parker that observes work anywhere
+    // in the ring re-loops and either dequeues locally or steals.
+#if BOC_HAVE_TRY_STEAL
+    if (boc_sched_any_work_visible()) {
+      continue;
+    }
+#else
+    if (!boc_wsq_is_empty(self)) {
+      continue;
+    }
+#endif
+
+    // Final epoch re-check under cv_mu. Drops the GIL across the
+    // wait so other Python work can proceed. terminator_count is
+    // NOT consulted here — quiescence is transient; only
+    // stop_requested causes exit.
+    Py_BEGIN_ALLOW_THREADS mtx_lock(&self->cv_mu);
+    if (boc_atomic_load_bool_explicit(&self->stop_requested, BOC_MO_ACQUIRE)) {
+      mtx_unlock(&self->cv_mu);
+    } else if (boc_atomic_load_u64_explicit(&UNPAUSE_EPOCH, BOC_MO_ACQUIRE) !=
+               ue_snap) {
+      // A producer caught up between our epoch bump and the lock;
+      // skip the wait and re-loop.
+      mtx_unlock(&self->cv_mu);
+    } else {
+      // Bump the cumulative `parked` counter before the actual
+      // wait so a snapshot from another thread sees the entry
+      // even if the wait blocks indefinitely. Live PARKED_COUNT
+      // tracks current depth; stats.parked tracks total entries.
+      boc_atomic_fetch_add_u64_explicit(&self->stats.parked, 1, BOC_MO_RELAXED);
+      boc_atomic_store_bool_explicit(&self->parked, true, BOC_MO_RELEASE);
+      boc_atomic_fetch_add_u32_explicit(&PARKED_COUNT, 1, BOC_MO_ACQ_REL);
+      cnd_wait(&self->cv, &self->cv_mu);
+      boc_atomic_fetch_sub_u32_explicit(&PARKED_COUNT, 1, BOC_MO_ACQ_REL);
+      boc_atomic_store_bool_explicit(&self->parked, false, BOC_MO_RELEASE);
+      mtx_unlock(&self->cv_mu);
+    }
+    Py_END_ALLOW_THREADS
+  }
+}
+
+// ---------------------------------------------------------------------------
+// Dispatch + fast-path pop
+// ---------------------------------------------------------------------------
+
+boc_bq_node_t *boc_sched_worker_pop_fast(boc_sched_worker_t *self) {
+  if (self == NULL) {
+    return NULL;
+  }
+
+  // BATCH_SIZE fairness: take pending only while batch > 0. When
+  // batch hits 0, fall through to the queue so a producer-local
+  // chain (which evicts every prior pending into the queue) cannot
+  // run newest-first forever and starve the older queued items.
+  // Verona `schedulerthread.h:122-138`.
+  if (pending != NULL && batch > 0) {
+    boc_bq_node_t *n = pending;
+    pending = NULL;
+    batch--;
+    boc_atomic_fetch_add_u64_explicit(&self->stats.popped_local, 1,
+                                      BOC_MO_RELAXED);
+    return n;
+  }
+
+  // ----- Steal-for-fairness gate (Verona schedulerthread.h:143) -----
+  //
+  // Verona's `get_work` runs the fairness arm AFTER consuming
+  // `next_work` (≈ `pending`) but BEFORE draining the local queue.
+  // We mirror that order here: a busy worker steadily draining its
+  // own queue still pays the per-token-period fairness tax, by
+  // routing through `pop_slow` (which owns the arm body —
+  // re-enqueue token, attempt steal, clear flag).
+  //
+  // Returning NULL here costs the caller one extra function-call
+  // (`pop_slow`) per fairness period; the arm itself has the same
+  // cost it has always had.
+  if (boc_atomic_load_bool_explicit(&self->should_steal_for_fairness,
+                                    BOC_MO_ACQUIRE) &&
+      !boc_wsq_is_empty(self)) {
+    return NULL;
+  }
+
+  boc_bq_node_t *n = boc_wsq_dequeue(self);
+  if (n != NULL) {
+    // Any successful queue dequeue resets the budget; if pending was
+    // bypassed because batch had hit 0, count this as a batch_reset
+    // for the fairness exit-criterion test. (A first-time pop with
+    // an empty pending also resets the budget but does not bump the
+    // counter — there was no fast path to bypass.)
+    if (pending != NULL) {
+      boc_atomic_fetch_add_u64_explicit(&self->stats.batch_resets, 1,
+                                        BOC_MO_RELAXED);
+    }
+    batch = BOC_BQ_BATCH_SIZE;
+    boc_atomic_fetch_add_u64_explicit(&self->stats.popped_local, 1,
+                                      BOC_MO_RELAXED);
+    return n;
+  }
+
+  // Queue is empty. If pending is set we exhausted the batch budget
+  // but have nothing else to fall back on — take pending and reset.
+  // Without this branch a single-worker chain would loop into
+  // pop_slow and park the worker against its own pending item.
+  if (pending != NULL) {
+    boc_bq_node_t *p = pending;
+    pending = NULL;
+    batch = BOC_BQ_BATCH_SIZE;
+    boc_atomic_fetch_add_u64_explicit(&self->stats.popped_local, 1,
+                                      BOC_MO_RELAXED);
+    return p;
+  }
+
+  return NULL;
+}
+
+int boc_sched_dispatch(boc_bq_node_t *n) {
+  boc_sched_worker_t *self = current_worker;
+  boc_sched_worker_t *target;
+
+  if (self != NULL) {
+    // Producer-local arm (Verona schedule_fifo).
+    // Always evict the prior `pending` to the local queue and
+    // install `n` as the new pending. The eviction (not the install)
+    // is what bumps `pushed_local`: the queue push is the externally
+    // visible event for stats purposes; replacing pending with no
+    // prior occupant is a free local handoff that costs nothing
+    // measurable.
+    if (pending != NULL) {
+      boc_wsq_enqueue(self, pending);
+      boc_atomic_fetch_add_u64_explicit(&self->stats.pushed_local, 1,
+                                        BOC_MO_RELAXED);
+    } else {
+      // Producer-locality bypass: dispatch into an empty `pending`
+      // slot. No queue push, no atomic queue-side state mutation,
+      // but bump `dispatched_to_pending` so the dispatched-work
+      // total remains globally reconcilable as
+      // `Σ pushed_local + Σ dispatched_to_pending + Σ pushed_remote
+      // == Σ popped_local + Σ popped_via_steal`. Without this
+      // bump the queue's `pushed_local` underreports total
+      // dispatched work whenever steady-state pop-then-dispatch
+      // keeps `pending` empty most cycles.
+      boc_atomic_fetch_add_u64_explicit(&self->stats.dispatched_to_pending, 1,
+                                        BOC_MO_RELAXED);
+    }
+    pending = n;
+    target = self;
+  } else {
+    // Off-worker arm: round-robin over the worker ring.
+    //
+    // Acquire-load WORKER_COUNT and INCARNATION so we observe the
+    // RELEASE-stores from `boc_sched_shutdown` BEFORE we could
+    // observe a freed WORKERS[] slot. Without this acquire, an
+    // off-worker producer running concurrently with shutdown
+    // could read a stale WORKER_COUNT > 0 and dereference
+    // WORKERS[0] after it had been freed.
+    Py_ssize_t wc =
+        (Py_ssize_t)boc_atomic_load_u64_explicit(&WORKER_COUNT, BOC_MO_ACQUIRE);
+    // Re-seed `rr_nonlocal` whenever the scheduler incarnation
+    // changes so a `start()`/`wait()`/`start()` cycle with a
+    // different worker count cannot land on a stale pointer.
+    size_t inc_now =
+        (size_t)boc_atomic_load_u64_explicit(&INCARNATION, BOC_MO_ACQUIRE);
+    // Check WORKER_COUNT FIRST so the runtime-down sentinel is
+    // honoured even when the cached `rr_nonlocal` is non-NULL but
+    // points into the prior incarnation's freed array (the
+    // shutdown-then-restart-with-different-count race).
+    if (wc == 0) {
+      // No runtime up — surface as a Python exception. Prior
+      // behaviour was a silent drop, which left whencall's
+      // `terminator_inc` un-rolled-back: the next `wait()` would
+      // hang because the caller's hold was never released. The
+      // caller (`whencall` in `behaviors.py`) catches this and
+      // calls `terminator_dec` to roll back its hold.
+      PyErr_SetString(
+          PyExc_RuntimeError,
+          "cannot schedule behavior: bocpy runtime is not running. "
+          "Call bocpy.start() before scheduling, or avoid scheduling "
+          "after wait() / stop() has shut the runtime down.");
+      return -1;
+    }
+    if (rr_nonlocal == NULL || rr_incarnation != inc_now) {
+      rr_nonlocal = &WORKERS[0];
+      rr_incarnation = inc_now;
+    }
+    target = rr_nonlocal;
+    boc_wsq_enqueue(target, n);
+    boc_atomic_fetch_add_u64_explicit(&target->stats.pushed_remote, 1,
+                                      BOC_MO_RELAXED);
+    rr_nonlocal = rr_nonlocal->next_in_ring;
+  }
+
+  // ---- Slow arm: pause/unpause-aware wake -----------------------------
+  //
+  // Producer half of the parking protocol. Loaded with acquire so
+  // the parker's seq_cst PAUSE_EPOCH bump is observed in order. If
+  // pe == ue the fast path is taken (no parker is racing); otherwise
+  // CAS UNPAUSE_EPOCH forward and, on CAS-win, broadcast-wake every
+  // parked peer.
+  uint64_t pe = boc_atomic_load_u64_explicit(&PAUSE_EPOCH, BOC_MO_ACQUIRE);
+  uint64_t ue = boc_atomic_load_u64_explicit(&UNPAUSE_EPOCH, BOC_MO_ACQUIRE);
+  if (pe != ue) {
+    if (boc_atomic_compare_exchange_strong_u64_explicit(
+            &UNPAUSE_EPOCH, &ue, pe, BOC_MO_ACQ_REL, BOC_MO_ACQUIRE)) {
+      // Walk from `target` so the wake prefers a peer rather than
+      // the worker we just published to (which is either us or the
+      // round-robin target — both cases are awake or about to be
+      // signalled). For off-worker dispatch `self` is NULL so we
+      // pass `target` directly; for producer-local we pass `self`.
+      boc_sched_unpause_all(self != NULL ? self : target);
+    }
+  }
+
+  // Targeted wake when crossing to a different worker. Producer-
+  // local dispatch (target == self) skips this: the producer thread
+  // is the worker that will run the work, so it cannot be parked.
+  if (self == NULL || target != self) {
+    boc_sched_signal_one(target);
+  }
+
+  return 0;
+}
+
+// ---------------------------------------------------------------------------
+// Work stealing (`try_steal`)
+// ---------------------------------------------------------------------------
+//
+// Port of the work-stealing primitive from
+// `verona-rt/src/rt/sched/schedulerthread.h::try_steal` plus the
+// underlying queue-level steal at
+// `verona-rt/src/rt/sched/workstealingqueue.h::steal`. Each worker
+// owns a `boc_bq_t q[BOC_WSQ_N]` sub-queue array; this thief reads
+// the victim's sub-queue indexed by `self->steal_index` (verona's
+// `this->steal_index`) and `enqueue_spread`s the remainder across
+// its own N sub-queues to dilute thief-vs-thief contention on
+// subsequent steals.
+//
+// `boc_sched_try_steal` is the **single-victim** fast attempt: at
+// most one `dequeue_all` call against `victim->q[steal_index]`,
+// then the per-thread victim cursor advances unconditionally so the
+// next attempt visits a different victim regardless of outcome. The
+// slow multi-victim loop with quiescence timeout (Verona's
+// `steal()`) follows.
+
+/// @brief Single-victim work-stealing attempt for @p self.
+/// @details Reads the per-thread @c steal_victim cursor (lazy-
+/// initialised to @c self->next_in_ring), tries to steal one node
+/// from the victim's WSQ sub-queue selected by @c self->steal_index,
+/// advances the victim cursor, and returns the stolen node (or NULL
+/// on miss). Verona equivalent: `SchedulerThread::try_steal`
+/// (`schedulerthread.h:237-254`) calling
+/// `WorkStealingQueue::steal` (`workstealingqueue.h:103-114`).
+///
+/// **Steal-cursor advance.** Verona only advances `steal_index` on
+/// the self-victim case (`if (&victim == this) { ++steal_index;
+/// return nullptr; }`); successful steals from non-self victims
+/// keep the cursor — the next attempt naturally picks a different
+/// victim's *same* sub-queue index, which is the spread the design
+/// relies on (in concert with @ref boc_wsq_enqueue_spread on the
+/// thief side).
+///
+/// **Splice contract.** `boc_bq_dequeue_all` returns a segment of
+/// every node visible at the call (modulo concurrent enqueuers
+/// mid-link). After taking the head we splice the remainder via
+/// @ref boc_wsq_enqueue_spread so the work is reachable from all
+/// of @p self's sub-queues — diluting collisions when more thieves
+/// subsequently attempt to steal from @p self.
+///
+/// **No-op for self-victim.** A single-worker runtime has
+/// `self->next_in_ring == self`. Per verona we advance
+/// @c steal_index and return NULL.
+///
+/// @param self Calling worker (must be non-NULL; caller guarantees).
+/// @return Stolen node, or NULL if (a) the victim was self,
+///         (b) the victim's sub-queue was empty, or (c) the steal
+///         spuriously failed (link not yet visible). The caller
+///         decides whether to retry against the next victim or
+///         park.
+
+static boc_bq_node_t *boc_sched_try_steal(boc_sched_worker_t *self) {
+  // Lazy-init the cursor on first use. WORKER_COUNT == 0 cannot
+  // happen here because every caller has a registered self handle.
+  if (steal_victim == NULL) {
+    steal_victim = self->next_in_ring;
+  }
+
+  boc_sched_worker_t *victim = steal_victim;
+  // Advance the victim cursor unconditionally. Verona does this
+  // after the steal call (whether the call returned work or not);
+  // placing the store before the work-doing code keeps the function
+  // tail-clean (no bookkeeping on the success path).
+  steal_victim = steal_victim->next_in_ring;
+
+  // Stamp the monotonic timestamp before any other bookkeeping so
+  // a snapshot taken concurrently observes the entry even if the
+  // call returns NULL early (self-victim, empty victim, etc.).
+  // Relaxed is fine: the field is diagnostic; readers tolerate a
+  // torn read between this store and the snapshot's load.
+  boc_atomic_store_u64_explicit(&self->stats.last_steal_attempt_ns,
+                                boc_now_ns(), BOC_MO_RELAXED);
+
+  boc_atomic_fetch_add_u64_explicit(&self->stats.steal_attempts, 1,
+                                    BOC_MO_RELAXED);
+
+  // Don't steal from yourself (Verona `WorkStealingQueue::steal`
+  // self-check: `if (&victim == this) { ++steal_index; return
+  // nullptr; }`). Counts as a failure for diagnostic purposes — a
+  // single-worker runtime will see steal_failures == steal_attempts
+  // which is the expected steady state.
+  if (victim == self) {
+    boc_wsq_pre_inc(&self->steal_index);
+    boc_atomic_fetch_add_u64_explicit(&self->stats.steal_failures, 1,
+                                      BOC_MO_RELAXED);
+    return NULL;
+  }
+
+  // Pick the victim's sub-queue indexed by *this thief's*
+  // steal_index (verona: `victim.queues[steal_index]`, where the
+  // index belongs to the calling WSQ — the thief). The cursor is
+  // touched only by `self`, so no atomic is needed.
+  size_t vidx = self->steal_index.idx;
+  boc_bq_segment_t seg = boc_bq_dequeue_all(&victim->q[vidx]);
+
+  // Try to take the head off the segment.
+  boc_bq_node_t *r = boc_bq_segment_take_one(&seg);
+  if (r == NULL) {
+    // take_one returns NULL for three reasons (mpmcq.h:67-89):
+    //   1. fully empty segment (start == NULL, end == NULL),
+    //   2. single-element segment (end == &start->next_in_queue),
+    //   3. first link in segment not yet visible (start != NULL,
+    //      next_in_queue still NULL).
+    //
+    // Case 1: nothing to steal — return NULL. Verona's
+    // `WorkStealingQueue::steal` `if (ls.end == nullptr) return
+    // nullptr;`.
+    if (seg.end == NULL) {
+      boc_atomic_fetch_add_u64_explicit(&self->stats.steal_failures, 1,
+                                        BOC_MO_RELAXED);
+      return NULL;
+    }
+    // Case 2: the segment IS our stolen node — verona returns
+    // `ls.start` directly without spreading anything (there is no
+    // remainder). `workstealingqueue.h:107-108`.
+    if (seg.start != NULL && seg.end == &seg.start->next_in_queue) {
+      r = seg.start;
+      boc_atomic_fetch_add_u64_explicit(&self->stats.popped_via_steal, 1,
+                                        BOC_MO_RELAXED);
+      return r;
+    }
+    // Case 3: take_one observed start != NULL but start->next not
+    // yet visible (the producer has done `back.exchange` but not
+    // yet published the next pointer). The segment is "owned" by
+    // us (acquire_front succeeded inside dequeue_all) and we
+    // cannot safely splice it back into the victim mid-link.
+    //
+    // Verona faithful: `WorkStealingQueue::steal` falls through to
+    // `enqueue_spread(ls); return r;` here, with `r == nullptr`.
+    // We do the same — spread the partial segment onto our own
+    // sub-queues and return NULL so the caller re-loops to its own
+    // dequeue.
+    boc_wsq_enqueue_spread(self, seg);
+    boc_atomic_fetch_add_u64_explicit(&self->stats.steal_failures, 1,
+                                      BOC_MO_RELAXED);
+    return NULL;
+  }
+
+  // Common case: head taken; spread the rest across self's N
+  // sub-queues so subsequent thieves stealing from self see N
+  // independent targets instead of one. Verona:
+  // `enqueue_spread(ls); return r;`.
+  boc_wsq_enqueue_spread(self, seg);
+  boc_atomic_fetch_add_u64_explicit(&self->stats.popped_via_steal, 1,
+                                    BOC_MO_RELAXED);
+  return r;
+}
+
+// ---------------------------------------------------------------------------
+// Slow steal loop
+// ---------------------------------------------------------------------------
+//
+// Port of `verona-rt/src/rt/sched/schedulerthread.h::steal` adapted
+// for bocpy's parking protocol. The main differences:
+//
+//   * Verona has no separate park primitive: its `steal()` busy-spins
+//     with a TSC-quiescence backoff and only commits to the global
+//     `pause` state after the timeout. bocpy already has a condvar
+//     park, so the slow loop's job is *not* to outwait contention —
+//     it just gives a producer a small pre-park grace window in case
+//     work is about to be published, then returns NULL so the caller
+//     (`pop_slow`) parks under cv_mu.
+//
+//   * Verona walks `running` (a flag flipped by the global pause()
+//     side); bocpy walks `self->stop_requested` (per-worker, set by
+//     `boc_sched_worker_request_stop_all`).
+//
+//   * Verona uses TSC ticks (`DefaultPal::tick`) for the quiescence
+//     gate; bocpy uses @ref boc_now_ns (CLOCK_MONOTONIC on POSIX,
+//     QueryPerformanceCounter on Windows).
+//
+// Loop shape (per round):
+//   1. stop_requested check.
+//   2. yield (BOC_SCHED_YIELD).
+//   3. own queue dequeue (catch work that another thread published
+//      onto our q since the last pop attempt).
+//   4. one full ring of `try_steal` calls (bounded at
+//      `WORKER_COUNT - 1` distinct victims; self-victim is skipped
+//      and counted as a failure).
+//   5. on miss, sample the monotonic clock; if the elapsed time
+//      since loop entry exceeds @ref BOC_STEAL_QUIESCENCE_NS,
+//      return NULL → caller parks. Otherwise sleep briefly and
+//      retry.
+//
+// The constant @ref BOC_STEAL_QUIESCENCE_NS is a tunable; 100µs
+// matches Verona's `TSC_QUIESCENCE_TIMEOUT` order of magnitude on
+// contemporary CPUs. The pre-park backoff is a `nanosleep`-style
+// short sleep rather than a busy spin so two parked workers do not
+// race their own backoff loops to 100% CPU.
+
+#ifndef BOC_STEAL_QUIESCENCE_NS
+#define BOC_STEAL_QUIESCENCE_NS 100000ULL // 100µs
+#endif
+
+#ifndef BOC_STEAL_BACKOFF_NS
+#define BOC_STEAL_BACKOFF_NS 5000ULL // 5µs sleep between rounds
+#endif
+
+/// @brief Multi-victim steal with a brief quiescence window.
+/// @details Single full ring of @ref boc_sched_try_steal calls;
+/// repeats while @ref BOC_STEAL_QUIESCENCE_NS has not elapsed.
+/// Returns the first successfully stolen node, or NULL if the
+/// quiescence window expires with every ring round empty (in which
+/// case the caller should commit to parking).
+///
+/// **stop_requested honour.** Checked at the top of every round so
+/// shutdown is observed even mid-spin.
+///
+/// **Own-queue catch.** Before each ring we re-check `self->q`: a
+/// concurrent producer (cross-worker dispatch, or another thief
+/// splicing remainder onto us) may have published since the last
+/// `pop_fast` attempt.
+///
+/// @param self Calling worker (must be non-NULL).
+/// @return Stolen node, or NULL if the quiescence window expired
+///         or shutdown was requested.
+static boc_bq_node_t *boc_sched_steal(boc_sched_worker_t *self) {
+  Py_ssize_t wc =
+      (Py_ssize_t)boc_atomic_load_u64_explicit(&WORKER_COUNT, BOC_MO_RELAXED);
+  if (wc <= 1) {
+    // No peers to steal from. Skip the whole loop — the caller
+    // will park immediately, which is the only sensible behaviour
+    // on a single-worker runtime. We do not bump steal_attempts
+    // here: the call did not actually visit a victim.
+    return NULL;
+  }
+
+  const uint64_t deadline = boc_now_ns() + BOC_STEAL_QUIESCENCE_NS;
+
+  for (;;) {
+    if (boc_atomic_load_bool_explicit(&self->stop_requested, BOC_MO_ACQUIRE)) {
+      return NULL;
+    }
+
+    BOC_SCHED_YIELD();
+
+    // Own-queue catch (Verona schedulerthread.h:269-272).
+    boc_bq_node_t *n = boc_wsq_dequeue(self);
+    if (n != NULL) {
+      return n;
+    }
+
+    // One full ring of try_steal. WORKER_COUNT - 1 visits is
+    // enough to attempt every distinct peer once; the cursor
+    // advances inside try_steal so successive calls see different
+    // victims. self-victim is automatically skipped (and counted
+    // as a steal_failure) so a single loop iteration may visit
+    // self once when WORKER_COUNT == 2 (cursor 0→1→0) — that is
+    // benign, the worst case is one wasted check.
+    for (Py_ssize_t i = 0; i < wc - 1; ++i) {
+      n = boc_sched_try_steal(self);
+      if (n != NULL) {
+        return n;
+      }
+    }
+
+    // Quiescence gate: if the window has expired, give up and let
+    // the caller park. Without this gate we would either busy-spin
+    // forever (waste CPU) or have no preemption between unrelated
+    // workers (subtle starvation under the GIL). The window must
+    // be short enough that a worker waiting one quiescence-period
+    // does not hurt latency-sensitive workloads; 100µs is well
+    // below any realistic behaviour body and matches Verona's
+    // TSC_QUIESCENCE_TIMEOUT in order of magnitude.
+    if (boc_now_ns() >= deadline) {
+      return NULL;
+    }
+
+    // Brief sleep so two concurrently-failing thieves do not pin
+    // their cores. Using `boc_sleep_ns` (compat.h) rather than
+    // `sched_yield` because we want a hard backoff: a yield is
+    // ineffective when there is no other runnable thread (the
+    // case during quiescence).
+    boc_sleep_ns(BOC_STEAL_BACKOFF_NS);
+  }
+}
+
+// ---------------------------------------------------------------------------
+// Per-worker fairness token (`token_work`)
+// ---------------------------------------------------------------------------
+//
+// `token_work` is a `boc_atomic_ptr_t` slot inside `boc_sched_worker_t`.
+// The token itself is a `BOCBehavior` allocated by
+// `_core_scheduler_runtime_start` (which is the only TU that knows
+// the `BOCBehavior` layout); this TU treats it as an opaque
+// `boc_bq_node_t *`. Lifecycle:
+//
+//   * `_core_scheduler_runtime_start` calls `boc_sched_init` then, for
+//     every worker, allocates a token `BOCBehavior` (zero-initialised,
+//     `is_token = 1`) and installs `&token->bq_node` here.
+//   * `_core_scheduler_runtime_stop` calls `boc_sched_get_token_node`
+//     for each worker to recover the pointer, frees the `BOCBehavior`,
+//     then calls `boc_sched_shutdown`.
+//
+// The slot is never freed by `boc_sched_shutdown` — that would require
+// this TU to dereference a `BOCBehavior`, breaking the layered
+// boundary. Releasing it before shutdown is a `_core.c` responsibility.
+
+int boc_sched_set_token_node(Py_ssize_t worker_index, boc_bq_node_t *node) {
+  Py_ssize_t wc =
+      (Py_ssize_t)boc_atomic_load_u64_explicit(&WORKER_COUNT, BOC_MO_RELAXED);
+  if (worker_index < 0 || worker_index >= wc) {
+    return -1;
+  }
+  // Release-store: a worker thread later doing an acquire-load on
+  // `token_work` (e.g. token re-enqueue path) must observe the
+  // node and any of its initialised fields written by the producer.
+  boc_atomic_store_ptr_explicit(&WORKERS[worker_index].token_work, (void *)node,
+                                BOC_MO_RELEASE);
+  return 0;
+}
+
+boc_bq_node_t *boc_sched_get_token_node(Py_ssize_t worker_index) {
+  Py_ssize_t wc =
+      (Py_ssize_t)boc_atomic_load_u64_explicit(&WORKER_COUNT, BOC_MO_RELAXED);
+  if (worker_index < 0 || worker_index >= wc) {
+    return NULL;
+  }
+  return (boc_bq_node_t *)boc_atomic_load_ptr_explicit(
+      &WORKERS[worker_index].token_work, BOC_MO_ACQUIRE);
+}
+
+void boc_sched_set_steal_flag(boc_sched_worker_t *self, bool value) {
+  if (self == NULL) {
+    return;
+  }
+  // Release-store: pairs with the acquire-load at the top of the
+  // fairness arm in `boc_sched_worker_pop_slow`. Verona equivalent
+  // is the closure body in `core.h:28-32`
+  // (`this->should_steal_for_fairness = true`).
+  boc_atomic_store_bool_explicit(&self->should_steal_for_fairness, value,
+                                 BOC_MO_RELEASE);
+}
+
+bool boc_sched_any_work_visible(void) {
+  Py_ssize_t wc =
+      (Py_ssize_t)boc_atomic_load_u64_explicit(&WORKER_COUNT, BOC_MO_RELAXED);
+  // Walk the full worker array. `boc_bq_is_empty` is an acquire-
+  // load on the queue's `front` pointer — cheap, no global lock.
+  // The walk is racy by design (a producer publishing onto a
+  // queue we have already passed will force itself through the
+  // CAS arm of the parker protocol; see `unpause_all`), so a
+  // stale `false` is acceptable: the epoch re-check under `cv_mu`
+  // catches it before the parker sleeps.
+  for (Py_ssize_t i = 0; i < wc; ++i) {
+    if (!boc_wsq_is_empty(&WORKERS[i])) {
+      return true;
+    }
+  }
+  return false;
+}
\ No newline at end of file
diff --git a/src/bocpy/sched.h b/src/bocpy/sched.h
new file mode 100644
index 0000000..1b4b0a2
--- /dev/null
+++ b/src/bocpy/sched.h
@@ -0,0 +1,936 @@
+/// @file sched.h
+/// @brief Work-stealing scheduler: per-worker MPMC queues, parking, stats.
+///
+/// This translation unit owns:
+///   - the Verona-style intrusive MPMC behaviour queue (`boc_bq_*`),
+///   - per-worker statistics POD (@ref boc_sched_stats_t),
+///   - the process-global worker array (allocated by @ref boc_sched_init),
+///   - the per-start incarnation counter (@ref boc_sched_incarnation_get),
+///   - the dispatch / fast-pop / park-and-wait / work-stealing primitives,
+///   - per-worker fairness tokens.
+///
+/// Verona reference: `verona-rt/src/rt/sched/schedulerstats.h`,
+/// `mpmcq.h`, `core.h`, `schedulerthread.h`, `threadpool.h`.
+
+#ifndef BOCPY_SCHED_H
+#define BOCPY_SCHED_H
+
+#include <assert.h>
+#include <stdbool.h>
+#include <stddef.h>
+#include <stdint.h>
+
+#include <Python.h>
+
+#include "compat.h"
+
+// ---------------------------------------------------------------------------
+// Verona MPMC behaviour queue (`boc_bq_*`)
+// ---------------------------------------------------------------------------
+//
+// Port of `verona-rt/src/rt/sched/mpmcq.h`. Memory orderings match
+// `mpmcq.h` line-for-line; deviations are called out in the
+// doc-comments.
+//
+// The queue is intrusive: each node carries an `_Atomic` link
+// (`boc_bq_node_t::next_in_queue`). Production users embed a
+// `boc_bq_node_t` field (`BOCBehavior::bq_node`) and pass its address
+// to the enqueue/dequeue API; the queue never dereferences anything
+// other than the link, so larger user-defined payloads are reached
+// via container_of-style arithmetic at the call site.
+
+/// @brief Verona-style intrusive link node.
+/// @details Embedded at a struct-end position inside @c BOCBehavior
+/// (see `_core.c`). The queue treats nodes as opaque: the only field
+/// it reads or writes is @c next_in_queue. Test code may allocate
+/// bare @c boc_bq_node_t instances.
+typedef struct boc_bq_node {
+  /// @brief Intrusive forward link, payload type
+  /// `struct boc_bq_node *` stored in a `boc_atomic_ptr_t` slot for
+  /// MSVC compatibility (see `compat.h`).
+  /// @details Reads use @c BOC_MO_ACQUIRE (mpmcq.h:78,145); writes
+  /// use @c BOC_MO_RELEASE (mpmcq.h:113,174) or @c BOC_MO_RELAXED
+  /// (mpmcq.h:103,131) per Verona.
+  boc_atomic_ptr_t next_in_queue;
+} boc_bq_node_t;
+
+/// @brief Half-open contiguous range of nodes built by
+///        `boc_bq_dequeue_all` / consumed by `boc_bq_enqueue_segment`.
+/// @details Mirrors `MPMCQ::Segment` (mpmcq.h:58-90). `start` is the
+/// first node of the segment (NULL → empty); `end` points at the
+/// `next_in_queue` slot inside the *last* node, ready to be
+/// rewritten by the next enqueue.
+typedef struct boc_bq_segment {
+  /// @brief First node in the segment (NULL → empty segment).
+  boc_bq_node_t *start;
+  /// @brief Address of the `next_in_queue` slot of the last node
+  /// (a `boc_atomic_ptr_t` whose payload type is
+  /// `struct boc_bq_node *`).
+  boc_atomic_ptr_t *end;
+} boc_bq_segment_t;
+
+/// @brief MPMC behaviour queue.
+/// @details Empty representation: @c back == @c &front (mpmcq.h:36).
+/// Cacheline-padded so producers (writing @c back) do not false-share
+/// with consumers (reading @c front).
+typedef struct boc_bq {
+  /// @brief Multi-threaded producer end. Payload type is
+  /// `boc_atomic_ptr_t *` (the address of either `front` or some
+  /// node's `next_in_queue` slot); stored as `boc_atomic_ptr_t` for
+  /// MSVC.
+  boc_atomic_ptr_t back;
+  /// @brief Multi-threaded consumer end. Payload type is
+  /// `struct boc_bq_node *`.
+  boc_atomic_ptr_t front;
+  /// @brief Padding so the next `boc_bq_t` does not share a line.
+  char _pad[64 - 2 * sizeof(void *)];
+} boc_bq_t;
+
+/// @brief Default batch size.
+/// @details Mirrors Verona's `BATCH_SIZE` (`schedulerthread.h`).
+/// Consumed by the per-worker `pending`/batch accounting.
+static const size_t BOC_BQ_BATCH_SIZE = 100;
+
+/// @brief Optional schedule-perturbation hook.
+/// @details Expands to nothing in release builds; to `sched_yield()`
+/// when the TU is compiled with `-DBOC_SCHED_SYSTEMATIC`. Mirrors
+/// every `Systematic::yield()` site in `mpmcq.h` so the
+/// schedule-perturbation points the Verona authors validated against
+/// are preserved.
+#ifdef BOC_SCHED_SYSTEMATIC
+#include <sched.h>
+#define BOC_SCHED_YIELD() (void)sched_yield()
+#else
+#define BOC_SCHED_YIELD() ((void)0)
+#endif
+
+// --- Lifecycle -------------------------------------------------------------
+
+/// @brief Initialise an empty queue in place.
+/// @details Sets `back == &front` and `front == NULL`. Safe to call
+/// on a zeroed allocation.
+/// @param q The queue to initialise (must be non-NULL).
+void boc_bq_init(boc_bq_t *q);
+
+/// @brief Assert the queue is empty and tear it down.
+/// @details Mirrors Verona's `~MPMCQ` (mpmcq.h:213-217). Aborts via
+/// @c assert(3) in debug builds if the queue still holds nodes.
+/// @param q The queue to destroy (must be non-NULL).
+void boc_bq_destroy_assert_empty(boc_bq_t *q);
+
+// --- Producers -------------------------------------------------------------
+
+/// @brief Enqueue a single node at the back of the queue.
+/// @details Equivalent to `boc_bq_enqueue_segment({n, &n->next_in_queue})`.
+/// The node's `next_in_queue` is overwritten. Mirrors `MPMCQ::enqueue`
+/// (mpmcq.h:118-121).
+/// @param q The queue (must be non-NULL).
+/// @param n The node to enqueue (must be non-NULL).
+void boc_bq_enqueue(boc_bq_t *q, boc_bq_node_t *n);
+
+/// @brief Enqueue a pre-linked segment at the back of the queue.
+/// @details Mirrors `MPMCQ::enqueue_segment` (mpmcq.h:97-115).
+/// @param q The queue (must be non-NULL).
+/// @param s A non-empty segment.
+void boc_bq_enqueue_segment(boc_bq_t *q, boc_bq_segment_t s);
+
+/// @brief Insert a single node at the front of the queue.
+/// @details Mirrors `MPMCQ::enqueue_front` (mpmcq.h:123-135). Useful
+/// for handing a stolen node back to its owner ahead of any other
+/// pending work.
+/// @param q The queue (must be non-NULL).
+/// @param n The node to insert (must be non-NULL).
+void boc_bq_enqueue_front(boc_bq_t *q, boc_bq_node_t *n);
+
+// --- Consumers -------------------------------------------------------------
+
+/// @brief Try to dequeue a single node from the front.
+/// @details May spuriously return NULL even when the queue is non-
+/// empty (concurrent enqueuer mid-link). Callers must be prepared to
+/// retry. Mirrors `MPMCQ::dequeue` (mpmcq.h:140-184).
+/// @param q The queue (must be non-NULL).
+/// @return The dequeued node, or NULL.
+boc_bq_node_t *boc_bq_dequeue(boc_bq_t *q);
+
+/// @brief Try to detach the entire current contents of the queue.
+/// @details Returns a segment whose `start` is the old front and whose
+/// `end` is the old back; the caller iterates by chasing
+/// `next_in_queue`. May return an empty segment spuriously (same race
+/// as `boc_bq_dequeue`). Mirrors `MPMCQ::dequeue_all`
+/// (mpmcq.h:187-203).
+/// @param q The queue (must be non-NULL).
+/// @return A (possibly empty) segment.
+boc_bq_segment_t boc_bq_dequeue_all(boc_bq_t *q);
+
+/// @brief Atomically take exclusive ownership of the front pointer.
+/// @details Returns the old front and replaces it with NULL, making
+/// the queue *appear* empty to any concurrent consumer. The caller
+/// is responsible for restoring the front (or enqueuing the head
+/// elsewhere). Mirrors `MPMCQ::acquire_front` (mpmcq.h:41-56).
+/// @param q The queue (must be non-NULL).
+/// @return The previous front pointer (may be NULL).
+boc_bq_node_t *boc_bq_acquire_front(boc_bq_t *q);
+
+/// @brief Take a single node from the start of a segment in place.
+/// @details Mirrors `MPMCQ::Segment::take_one` (mpmcq.h:67-89). May
+/// return NULL if (1) the segment is empty, (2) the segment has a
+/// single element, or (3) the link from the head has not yet been
+/// completed by a concurrent enqueuer.
+/// @param s The segment (must be non-NULL); modified in place.
+/// @return The detached head, or NULL.
+boc_bq_node_t *boc_bq_segment_take_one(boc_bq_segment_t *s);
+
+// --- Inspection ------------------------------------------------------------
+
+/// @brief Best-effort emptiness test.
+/// @details Mirrors `MPMCQ::is_empty` (mpmcq.h:206-210). Result may
+/// be stale by the time the caller acts on it.
+/// @param q The queue (must be non-NULL).
+/// @return @c true if the queue currently appears empty.
+bool boc_bq_is_empty(boc_bq_t *q);
+
+// ---------------------------------------------------------------------------
+// Verona work-stealing queue cursors (`boc_wsq_*`)
+// ---------------------------------------------------------------------------
+//
+// Port of `verona-rt/src/rt/sched/workstealingqueue.h` and
+// `ds/wrapindex.h`. A WSQ is N independent `boc_bq_t` sub-queues
+// indexed by three plain-`size_t` cursors:
+//   - `enqueue_index`: producer side; pre-increment then push.
+//   - `dequeue_index`: owner pop side; pre-increment then pop, try
+//                       all N before declaring empty.
+//   - `steal_index`: thief side; selects which of the *victim*'s
+//                     sub-queues to drain in a steal attempt.
+//
+// All three cursors are owned by the worker that owns the WSQ.
+// `enqueue_index` is touched by every thread that pushes onto this
+// worker (including remote producers). The race on it is benign:
+// (1) `size_t` aligned loads/stores are atomic at the hardware level
+// on every ISA bocpy supports; (2) `(idx + 1) % N` is always in
+// `[0, N)` regardless of what value was read; (3) the underlying
+// `boc_bq_t` is multi-producer-safe; (4) the only observable effect
+// is distribution quality, bounded by concurrent-producer count.
+// Verona-rt accepts the same race; we make no deviation.
+
+/// @brief Number of sub-queues per worker WSQ.
+/// @details Matches verona-rt's `WorkStealingQueue<4>` template
+/// instantiation in `core.h`. Tunable at compile time.
+#ifndef BOC_WSQ_N
+#define BOC_WSQ_N 4
+#endif
+
+/// @brief Plain-`size_t` cursor mirroring verona-rt's
+///        `WrapIndex<N>` (`ds/wrapindex.h`).
+/// @details No atomic; the race on `enqueue_index` between
+/// concurrent producers is benign (see header block above).
+typedef struct boc_wsq_cursor {
+  /// @brief Current index in `[0, BOC_WSQ_N)`.
+  size_t idx;
+} boc_wsq_cursor_t;
+
+/// @brief Pre-increment the cursor (returns the new index).
+/// @details Mirrors `WrapIndex::operator++()` (`ds/wrapindex.h`):
+/// `index = (index + 1) % N; return index;`. Used by
+/// `enqueue` and the owner-side `dequeue` loop.
+/// @param c The cursor (must be non-NULL).
+/// @return The new index, in `[0, BOC_WSQ_N)`.
+static inline size_t boc_wsq_pre_inc(boc_wsq_cursor_t *c) {
+  c->idx = (c->idx + 1u) % (size_t)BOC_WSQ_N;
+  return c->idx;
+}
+
+/// @brief Post-decrement the cursor (returns the old index).
+/// @details Mirrors `WrapIndex::operator--(int)`
+/// (`ds/wrapindex.h`): `auto r = index; index = (r==0?N-1:r-1);
+/// return r;`. Reserved for a future `boc_wsq_enqueue_front`
+/// wrapper that pushes onto the head of the most-recently-popped
+/// sub-queue (verona's `WorkStealingQueue::enqueue_front`); no such
+/// wrapper exists in bocpy yet, so the only caller in-tree is the
+/// `_internal_test_wsq` shim that exercises the cursor arithmetic
+/// directly.
+/// @param c The cursor (must be non-NULL).
+/// @return The old index, in `[0, BOC_WSQ_N)`.
+static inline size_t boc_wsq_post_dec(boc_wsq_cursor_t *c) {
+  size_t r = c->idx;
+  c->idx = (r == 0u) ? ((size_t)BOC_WSQ_N - 1u) : (r - 1u);
+  return r;
+}
+
+// ---------------------------------------------------------------------------
+// Scheduler instrumentation
+// ---------------------------------------------------------------------------
+
+/// @brief Per-worker statistics counter block (POD).
+///
+/// All fields are plain @c uint64_t so a snapshot is a memcpy. Counters
+/// are written by their owning worker thread with
+/// @c memory_order_relaxed (see Verona `schedulerstats.h`); readers
+/// (the Python @c scheduler_stats accessor) load with the same ordering
+/// and accept torn reads — the snapshot is best-effort, not a barrier.
+typedef struct boc_sched_stats {
+  /// @brief Behaviours this worker pushed onto its own WSQ via the
+  /// producer-local arm of @ref boc_sched_dispatch.
+  /// @details Bumped only when an existing @c pending occupant is
+  /// evicted to the queue to make room for the new dispatch.
+  /// Dispatches that install into an empty @c pending slot bump
+  /// @ref dispatched_to_pending instead.
+  ///
+  /// **Reconciliation.** This counter records this worker's *role
+  /// as producer*. Across the whole pool the global identity
+  /// @c "Σ (pushed_local + dispatched_to_pending + pushed_remote)
+  /// == Σ (popped_local + popped_via_steal)" holds at quiescence.
+  /// **Per-worker** the same identity does NOT hold: nodes
+  /// redistributed onto a thief by @ref boc_wsq_enqueue_spread are
+  /// not re-counted on the thief, so a thief's per-worker
+  /// @c (pushed_local + dispatched_to_pending + pushed_remote -
+  /// popped_local - popped_via_steal) is biased and is **not** a
+  /// queue-depth estimate.
+  uint64_t pushed_local;
+  /// @brief Behaviours dispatched into an empty @c pending slot on
+  /// the producer-local arm of @ref boc_sched_dispatch.
+  /// @details The 1-deep producer-locality bypass: if @c pending is
+  /// NULL when @c boc_sched_dispatch fires, the new node is parked
+  /// in @c pending (no queue push) and this counter is bumped. Without
+  /// this counter the queue's @c pushed_local underreports total
+  /// dispatched work whenever the producer is steady-state
+  /// (pop-then-dispatch keeps @c pending empty most cycles), which
+  /// makes contention-on-queue gates look quiet even when the
+  /// dispatch path is saturated.
+  ///
+  /// Like @ref pushed_local this counter records this worker's
+  /// *producer* role; see the reconciliation note on @ref
+  /// pushed_local for the per-worker vs. global identity.
+  uint64_t dispatched_to_pending;
+  /// @brief Behaviours this worker pushed onto another worker's queue
+  /// via the round-robin dispatch path.
+  uint64_t pushed_remote;
+  /// @brief Behaviours this worker popped from its own queue.
+  uint64_t popped_local;
+  /// @brief Behaviours this worker stole from another worker's queue.
+  uint64_t popped_via_steal;
+  /// @brief CAS retries observed in the worker queue's enqueue path.
+  uint64_t enqueue_cas_retries;
+  /// @brief CAS retries observed in the worker queue's dequeue path.
+  uint64_t dequeue_cas_retries;
+  /// @brief Times the consumer-side @c BATCH_SIZE accounting forced
+  /// a queue dequeue to bypass the @c pending fast path. Verona
+  /// equivalent: the `batch == 0` branch in `get_work`
+  /// (`schedulerthread.h:122-138`).
+  uint64_t batch_resets;
+  /// @brief Times this worker entered @c boc_sched_try_steal.
+  /// @details Each call counts as one attempt regardless of whether
+  /// it returned a node (success bumps @ref popped_via_steal too) or
+  /// returned NULL (also bumps @ref steal_failures). Verona
+  /// equivalent: `core->stats.steal()` is summed implicitly per
+  /// entry in `schedulerthread.h::try_steal`.
+  uint64_t steal_attempts;
+  /// @brief Subset of @ref steal_attempts that returned NULL.
+  /// @details Diagnostic counter; useful for tuning the
+  /// quiescence-timeout in the slow-steal loop. Empty-self skips
+  /// also bump this.
+  uint64_t steal_failures;
+  /// @brief Times this worker entered @c cnd_wait under @c cv_mu.
+  /// @details Bumped immediately before the @c cnd_wait call in
+  /// @ref boc_sched_worker_pop_slow's park arm. Each park entry
+  /// counts once regardless of why the worker was woken (signal,
+  /// shutdown, spurious wake). Diagnostic; complements the live
+  /// @c parked bool on the worker (which is one when currently
+  /// blocked, zero otherwise).
+  uint64_t parked;
+  /// @brief Monotonic timestamp (ns) of this worker's last
+  /// @ref boc_sched_try_steal entry.
+  /// @details Stamped via @ref boc_now_ns on every
+  /// @ref boc_sched_try_steal call (success or failure). Zero if
+  /// the worker has never attempted a steal. Used by tests to
+  /// detect that the steal arm has actually fired and (with two
+  /// snapshots) to bound the duration a worker spent in the slow
+  /// steal loop.
+  uint64_t last_steal_attempt_ns;
+  /// @brief Times the steal-for-fairness arm in @ref
+  /// boc_sched_worker_pop_slow actually fired (flag observed set
+  /// AND local queue non-empty). Distinguishes "flag was set but
+  /// the worker never paid attention" (arm dead) from "flag was
+  /// set and the worker honoured it" (arm live). Diagnostic only.
+  uint64_t fairness_arm_fires;
+} boc_sched_stats_t;
+
+/// @brief Per-worker statistics counter block (live atomic copy).
+///
+/// Same field set as @ref boc_sched_stats_t but every field is a
+/// @c boc_atomic_u64_t so writers can use the @c boc_atomic_*
+/// helpers without compiler warnings about plain @c uint64_t*.
+/// @ref boc_sched_stats_snapshot loads each field with
+/// @c memory_order_relaxed and copies it into a @ref boc_sched_stats_t
+/// for the Python-side accessor; the snapshot is best-effort and may
+/// observe individual counter values from different points in time.
+/// Field order MUST match @ref boc_sched_stats_t one-for-one (the
+/// snapshot routine relies on the structural correspondence rather
+/// than a memcpy).
+typedef struct boc_sched_stats_atomic {
+  boc_atomic_u64_t pushed_local;
+  boc_atomic_u64_t dispatched_to_pending;
+  boc_atomic_u64_t pushed_remote;
+  boc_atomic_u64_t popped_local;
+  boc_atomic_u64_t popped_via_steal;
+  boc_atomic_u64_t enqueue_cas_retries;
+  boc_atomic_u64_t dequeue_cas_retries;
+  boc_atomic_u64_t batch_resets;
+  boc_atomic_u64_t steal_attempts;
+  boc_atomic_u64_t steal_failures;
+  boc_atomic_u64_t parked;
+  boc_atomic_u64_t last_steal_attempt_ns;
+  boc_atomic_u64_t fairness_arm_fires;
+} boc_sched_stats_atomic_t;
+
+// ---------------------------------------------------------------------------
+// Per-worker scheduler state (`boc_sched_worker_t`)
+// ---------------------------------------------------------------------------
+//
+// Holds the per-worker MPMC queue, the fairness-token slot
+// (`token_work` / `should_steal_for_fairness`), the parking-protocol
+// `cv_mu` / `cv` pair (`compat.h` `BOCMutex` / `BOCCond`, pthread on
+// POSIX, SRWLock on MSVC), the ring-link `next_in_ring` pointer, the
+// per-worker counter block, and a reserved terminator-delta slot.
+// Atomics use the typed `compat.h` shim (`boc_atomic_*_t` +
+// `boc_atomic_*_explicit`) so the layout compiles identically on POSIX
+// and MSVC ARM64.
+//
+// Cacheline-aligned at the type level (`alignas(BOC_SCHED_CACHELINE)`)
+// and a trailing pad rounds the size up to the next cacheline so that
+// arrays of workers do not false-share between adjacent slots. The
+// pad size is computed from a `_payload` helper struct so it tracks
+// the platform-dependent sizes of `BOCMutex` / `BOCCond` automatically.
+
+#ifndef BOC_SCHED_CACHELINE
+#define BOC_SCHED_CACHELINE 64
+#endif
+
+/// @brief Forward declaration of @ref BOCBehavior (defined in
+///        @c _core.c). The scheduler treats it as opaque; the
+///        producer-locality TLS @c pending slot stores it as
+///        `void *` to avoid any layout coupling.
+struct BOCBehavior;
+
+/// @brief Per-worker scheduler state (forward decl).
+typedef struct boc_sched_worker boc_sched_worker_t;
+
+/// @brief Helper struct used only to compute the trailing pad.
+/// @details The fields here are duplicated verbatim into
+/// @ref boc_sched_worker below; this helper is never instantiated and
+/// exists solely so that `sizeof` reports the unpadded payload size
+/// for the pad computation. Keeping the two field lists in sync is
+/// enforced by a `static_assert` after the real struct definition.
+struct boc_sched_worker_payload_ {
+  boc_bq_t q[BOC_WSQ_N];
+  boc_wsq_cursor_t enqueue_index;
+  boc_wsq_cursor_t dequeue_index;
+  boc_wsq_cursor_t steal_index;
+  boc_atomic_ptr_t token_work;
+  boc_atomic_bool_t should_steal_for_fairness;
+  boc_atomic_bool_t stop_requested;
+  boc_atomic_bool_t parked;
+  Py_ssize_t owner_interp_id;
+  BOCMutex cv_mu;
+  BOCCond cv;
+  struct boc_sched_worker *next_in_ring;
+  boc_sched_stats_atomic_t stats;
+  boc_atomic_u64_t reserved_terminator_delta;
+};
+
+/// @brief Trailing-pad byte count.
+/// @details Rounds @ref boc_sched_worker_payload_ up to the next
+/// multiple of @ref BOC_SCHED_CACHELINE. The outer `% CACHELINE`
+/// converts an exact-fit (zero pad needed) into 0 instead of one
+/// full cacheline.
+#define BOC_SCHED_WORKER_PAD_                                                  \
+  ((BOC_SCHED_CACHELINE -                                                      \
+    (sizeof(struct boc_sched_worker_payload_) % BOC_SCHED_CACHELINE)) %        \
+   BOC_SCHED_CACHELINE)
+
+/// @brief Per-worker scheduler state.
+/// @details All field semantics:
+///   - @c q: this worker's WSQ — array of @ref BOC_WSQ_N independent
+///     MPMC behaviour sub-queues. Pushes / pops / steals select a
+///     sub-queue via the three cursors below; mirrors verona-rt's
+///     `WorkStealingQueue<N>::queues[N]`.
+///   - @c enqueue_index / @c dequeue_index / @c steal_index:
+///     plain-`size_t` cursors (`boc_wsq_cursor_t`) ported from
+///     verona-rt's `WrapIndex<N>`. See the header block above
+///     @ref boc_wsq_cursor_t for the benign-race rationale.
+///   - @c token_work: fairness token's queue node.
+///   - @c should_steal_for_fairness: flag set when the fairness
+///     token is popped; consumed by @ref boc_sched_worker_pop_slow.
+///   - @c stop_requested: shutdown signal (`request_stop_all` writes
+///     it under release; `pop_slow` reads under acquire). Honoured
+///     by the parking loop only — never gated on the terminator.
+///   - @c parked: parking-protocol witness (REL/ACQ paired with
+///     @c cv_mu).
+///   - @c owner_interp_id: sub-interpreter id of the worker that
+///     called `boc_sched_worker_register` for this slot. Used for
+///     wrong-thread asserts in `pop`.
+///   - @c cv_mu / @c cv: parking-protocol mutex/condvar (compat.h
+///     wrappers).
+///   - @c next_in_ring: forms a circular singly-linked ring over
+///     @ref boc_sched_worker_count workers; immutable after
+///     @ref boc_sched_init.
+///   - @c stats: per-worker counter block.
+///   - @c reserved_terminator_delta: placeholder for a future
+///     per-worker terminator delta.
+struct boc_sched_worker {
+  /// @brief First member carries an explicit alignment so the struct
+  /// itself is cacheline-aligned (C11: `_Alignas` on a struct-type
+  /// definition is a C++ extension; placing the alignment on the
+  /// first member is the portable C equivalent and raises the
+  /// containing struct's alignment requirement to match).
+  ///
+  /// @details `q` is an array of `BOC_WSQ_N` independent MPMC sub-
+  /// queues; pushes / pops / steals route through different sub-
+  /// queues selected by the three cursors below. Mirrors
+  /// `WorkStealingQueue<N>::queues[N]` (verona-rt).
+  alignas(BOC_SCHED_CACHELINE) boc_bq_t q[BOC_WSQ_N];
+  /// @brief Producer cursor (`++` then push). Touched by every
+  /// thread that dispatches onto this worker; the race is benign
+  /// (see header block above @ref boc_wsq_cursor_t).
+  boc_wsq_cursor_t enqueue_index;
+  /// @brief Owner-pop cursor (`++` then pop, try all N before
+  /// declaring empty). Owner-only.
+  boc_wsq_cursor_t dequeue_index;
+  /// @brief Thief cursor selecting which of a *victim*'s sub-
+  /// queues to drain. Owner-only (this worker, when stealing).
+  boc_wsq_cursor_t steal_index;
+  boc_atomic_ptr_t token_work;
+  boc_atomic_bool_t should_steal_for_fairness;
+  boc_atomic_bool_t stop_requested;
+  boc_atomic_bool_t parked;
+  Py_ssize_t owner_interp_id;
+  BOCMutex cv_mu;
+  BOCCond cv;
+  struct boc_sched_worker *next_in_ring;
+  boc_sched_stats_atomic_t stats;
+  boc_atomic_u64_t reserved_terminator_delta;
+  /// @brief Trailing pad to the next cacheline boundary.
+  /// @details Sized so `sizeof(boc_sched_worker_t) % CACHELINE == 0`;
+  /// declared as @c [1] when no pad is needed (zero-length arrays are
+  /// not portable C). The post-definition `static_assert` guarantees
+  /// the array is never read past the live pad.
+  char _pad[BOC_SCHED_WORKER_PAD_ > 0 ? BOC_SCHED_WORKER_PAD_ : 1];
+};
+
+static_assert(sizeof(boc_sched_worker_t) % BOC_SCHED_CACHELINE == 0,
+              "boc_sched_worker_t must be cacheline-multiple in size");
+static_assert(alignof(boc_sched_worker_t) >= BOC_SCHED_CACHELINE,
+              "boc_sched_worker_t must be cacheline-aligned");
+
+// ---------------------------------------------------------------------------
+// Verona work-stealing queue helpers (`boc_wsq_*`)
+// ---------------------------------------------------------------------------
+//
+// Inline routing wrappers around the per-worker WSQ. They mirror
+// verona-rt's `WorkStealingQueue<N>` member functions one-for-one;
+// the underlying `boc_bq_*` MPMCQ is unchanged. Each wrapper takes a
+// `boc_sched_worker_t *` rather than a bare `boc_bq_t *` because the
+// cursor lives on the worker.
+
+/// @brief Push a single node onto a worker's WSQ.
+/// @details Mirrors `WorkStealingQueue::enqueue` (verona-rt
+/// `workstealingqueue.h`): pre-increments @c enqueue_index then
+/// pushes onto `q[idx]`. Safe to call from any thread; the cursor
+/// race is benign (see header block above @ref boc_wsq_cursor_t).
+/// @param w The target worker (must be non-NULL).
+/// @param n The node to enqueue (must be non-NULL).
+static inline void boc_wsq_enqueue(boc_sched_worker_t *w, boc_bq_node_t *n) {
+  size_t idx = boc_wsq_pre_inc(&w->enqueue_index);
+  boc_bq_enqueue(&w->q[idx], n);
+}
+
+/// @brief Owner-side pop from a worker's WSQ.
+/// @details Mirrors `WorkStealingQueue::dequeue` (verona-rt
+/// `workstealingqueue.h`): for `i in [0, N)`, pre-increment
+/// @c dequeue_index and try `boc_bq_dequeue(&q[idx])`; return the
+/// first non-NULL. Owner-only — @c dequeue_index has no atomic.
+/// @param w The owning worker (must be non-NULL).
+/// @return A behaviour node, or NULL if all N sub-queues appear
+///         empty (best-effort; same spurious-NULL caveat as
+///         @ref boc_bq_dequeue).
+static inline boc_bq_node_t *boc_wsq_dequeue(boc_sched_worker_t *w) {
+  for (size_t i = 0; i < (size_t)BOC_WSQ_N; ++i) {
+    size_t idx = boc_wsq_pre_inc(&w->dequeue_index);
+    boc_bq_node_t *n = boc_bq_dequeue(&w->q[idx]);
+    if (n != NULL) {
+      return n;
+    }
+  }
+  return NULL;
+}
+
+/// @brief Best-effort emptiness test across all N sub-queues.
+/// @details Mirrors `WorkStealingQueue::is_empty` (verona-rt
+/// `workstealingqueue.h`): scans every sub-queue; first non-empty
+/// short-circuits to `false`. Result may be stale by the time the
+/// caller acts on it — same caveat as @ref boc_bq_is_empty.
+/// @param w The worker to inspect (must be non-NULL).
+/// @return @c true if all N sub-queues currently appear empty.
+static inline bool boc_wsq_is_empty(boc_sched_worker_t *w) {
+  for (size_t i = 0; i < (size_t)BOC_WSQ_N; ++i) {
+    if (!boc_bq_is_empty(&w->q[i])) {
+      return false;
+    }
+  }
+  return true;
+}
+
+/// @brief Spread a segment across @p self's WSQ sub-queues.
+/// @details Mirrors `WorkStealingQueue::enqueue_spread` (verona-rt
+/// `workstealingqueue.h`):
+/// @code
+///   while ((n = ls.take_one()) != nullptr) enqueue(n);
+///   enqueue(ls);  // residual tail goes onto one sub-queue
+/// @endcode
+/// Each `take_one` peels one node off the head of the segment; the
+/// node is pushed via @ref boc_wsq_enqueue, which pre-increments
+/// @c enqueue_index so successive nodes round-robin across the N
+/// sub-queues. The final residual (typically a single node, or in
+/// the mid-link-race case a partial segment we cannot drain) is
+/// enqueued as a segment onto one freshly-chosen sub-queue.
+///
+/// Caller invariant: @p ls is non-empty (the steal-loop exit
+/// conditions guarantee this — fully empty and single-element
+/// segments are handled before falling through to spread).
+/// @param self The thief worker (must be non-NULL).
+/// @param ls   The segment to redistribute.
+static inline void boc_wsq_enqueue_spread(boc_sched_worker_t *self,
+                                          boc_bq_segment_t ls) {
+  for (;;) {
+    boc_bq_node_t *n = boc_bq_segment_take_one(&ls);
+    if (n == NULL) {
+      break;
+    }
+    boc_wsq_enqueue(self, n);
+  }
+  // Tail residual: verona pushes the final segment unconditionally
+  // onto a single sub-queue via `++enqueue_index`. With N=4 and
+  // typical steal segments of dozens of nodes, the spreading has
+  // already happened; the tail is at most a singleton (or a
+  // mid-link partial we could not drain).
+  size_t idx = boc_wsq_pre_inc(&self->enqueue_index);
+  boc_bq_enqueue_segment(&self->q[idx], ls);
+}
+
+/// @brief Initialise the scheduler module for a fresh runtime cycle.
+/// @details Allocates the per-worker array of length @p worker_count
+/// (zero-initialised) and increments the per-start incarnation counter
+/// (Verona `threadpool.h` precedent). Safe to call with
+/// @p worker_count == 0, in which case the scheduler is in a quiescent
+/// no-workers state. Called from @c behaviors.start() with the real
+/// worker count on every start cycle, after @ref boc_sched_shutdown
+/// has freed any prior array, and from @c _core_module_exec at module
+/// init with @p worker_count == 0.
+///
+/// Must be called with the GIL held (sets Python exceptions on
+/// failure and is sequenced from Python-visible module init /
+/// `behaviors.start()`). The underlying allocation uses
+/// @c PyMem_RawCalloc so the array is process-global and remains
+/// valid across sub-interpreter boundaries.
+/// @param worker_count Number of worker slots to allocate. Pass 0 for
+///                     a quiescent no-workers state.
+/// @return 0 on success, -1 on allocation failure (Python exception
+///         set).
+int boc_sched_init(Py_ssize_t worker_count);
+
+/// @brief Tear down the scheduler module's per-worker array.
+/// @details Frees the array allocated by @ref boc_sched_init and
+/// resets the worker count to 0. Idempotent. Counters are not
+/// archived anywhere — callers that want to keep them must snapshot
+/// first via @ref boc_sched_stats_snapshot.
+///
+/// Must be called with the GIL held.
+void boc_sched_shutdown(void);
+
+/// @brief Number of worker slots currently allocated.
+/// @return 0 if @ref boc_sched_init has not been called or the most
+///         recent @ref boc_sched_shutdown has run; otherwise the
+///         @c worker_count passed to the most recent @ref
+///         boc_sched_init.
+Py_ssize_t boc_sched_worker_count(void);
+
+/// @brief Borrow a pointer to one of the worker slots.
+/// @details Returns a non-owning pointer into the @c WORKERS array
+/// for use with the @c boc_bq_* primitives (e.g. orphan-drain on
+/// shutdown calls @c boc_bq_dequeue(&boc_sched_worker_at(i)->q)
+/// to walk each per-task queue from outside @c sched.c). The
+/// returned pointer is invalidated by @ref boc_sched_shutdown.
+/// @param worker_index Zero-based worker slot.
+/// @return Borrowed worker pointer, or NULL if @p worker_index is
+///         out of range. No Python exception is set on NULL.
+boc_sched_worker_t *boc_sched_worker_at(Py_ssize_t worker_index);
+
+/// @brief Copy the snapshot of one worker's counters into @p out.
+/// @details Reads use @c memory_order_relaxed; the snapshot is
+/// best-effort and may observe individual counter values from
+/// different points in time.
+/// @param worker_index Zero-based worker slot.
+/// @param out Destination POD; must be non-NULL.
+/// @return 0 on success, -1 if @p worker_index is out of range or
+///         @p out is NULL. No Python exception is set on -1.
+int boc_sched_stats_snapshot(Py_ssize_t worker_index, boc_sched_stats_t *out);
+
+/// @brief Read the current scheduler incarnation.
+/// @details Increments by exactly one on every @ref boc_sched_init
+/// call. TLS round-robin cursors compare against this value to detect
+/// that the worker array has been reallocated since they last cached
+/// a worker pointer.
+/// @return The current incarnation. Plain @c size_t (Verona
+///         `threadpool.h:40` precedent: not @c _Atomic).
+size_t boc_sched_incarnation_get(void);
+
+// ---------------------------------------------------------------------------
+// Per-worker registration
+// ---------------------------------------------------------------------------
+
+/// @brief Atomically claim a worker slot for the calling thread.
+/// @details Allocates the next free slot in @ref WORKERS using an
+/// internal atomic counter that is reset on every @ref boc_sched_init.
+/// Stamps the slot's @c owner_interp_id with a witness drawn from
+/// the calling sub-interpreter and sets the per-thread @c
+/// current_worker TLS handle so subsequent dispatch / pop operations
+/// on this thread find their worker without a hashtable lookup.
+///
+/// **Self-allocation rather than caller-supplied index.** Verona's
+/// `SchedulerThread` is constructed by the thread pool with a known
+/// index. bocpy worker sub-interpreters share a single
+/// `worker_script` that has no static knowledge of which slot it
+/// will inhabit, and `_core.index()` is a process-monotonic counter
+/// that does not reset across `start()`/`wait()`/`start()` cycles.
+/// A self-allocating register() is the cleanest way to map worker
+/// threads to slots 0..worker_count-1 in re-entry-safe fashion. The
+/// contract is: over-registration returns -1.
+///
+/// Must be called with the GIL held (writes the TLS handle and
+/// reads sub-interpreter state).
+/// @return The assigned slot (0 .. worker_count-1) on success, or -1
+///         if no free slot remains. No Python exception is set on -1.
+Py_ssize_t boc_sched_worker_register(void);
+
+// ---------------------------------------------------------------------------
+// Park / unpark protocol
+// ---------------------------------------------------------------------------
+//
+// Port of Verona's two-epoch `pause`/`unpause` protocol from
+// `verona-rt/src/rt/sched/threadpool.h:282-379`.
+
+/// @brief Pop the next behaviour for the calling worker, blocking
+///        until work arrives or shutdown is requested.
+/// @details Implements the parker side of the protocol. The
+/// caller must have previously called @ref boc_sched_worker_register
+/// (so @c current_worker TLS is set; @p self is passed explicitly so
+/// the implementation does not have to re-resolve TLS on every loop
+/// iteration). Reached on every worker loop iteration in
+/// @c worker.py::do_work after the local-queue / pending fast paths
+/// return NULL.
+///
+/// **Returns NULL only when @c self->stop_requested is observed
+/// true.** Quiescence (the terminator reaching zero) does not exit
+/// the parker; that is the runtime's responsibility, signalled
+/// through @ref boc_sched_worker_request_stop_all.
+///
+/// Releases the GIL across the actual @c cnd_wait so other Python
+/// work can proceed while the worker is parked.
+///
+/// **Returns the dequeued queue node, not the containing
+/// @c BOCBehavior.** Callers in @c _core.c convert the node to its
+/// owning behaviour via the standard `container_of` arithmetic
+/// (see @c BEHAVIOR_FROM_BQ_NODE in @c _core.c). Keeping the
+/// scheduler decoupled from the @c BOCBehavior layout avoids a
+/// circular header dependency between @c sched.h and
+/// @c _core.c's behaviour struct.
+boc_bq_node_t *boc_sched_worker_pop_slow(boc_sched_worker_t *self);
+
+/// @brief Set @c stop_requested on every worker and wake them all.
+/// @details Issued by @c behaviors.stop_workers() after the runtime
+/// is quiescent. Each worker exits @ref boc_sched_worker_pop_slow on
+/// its next loop iteration (or immediately, if currently parked).
+/// Idempotent.
+void boc_sched_worker_request_stop_all(void);
+
+/// @brief Wake every parked worker in the ring.
+/// @details Walks the worker ring once starting from
+/// @p self->next_in_ring and sends a @c cnd_signal to every worker
+/// whose @c parked flag is true. Called from the producer side of
+/// the parking protocol when a CAS on @c unpause_epoch wins;
+/// mirrors Verona's @c ThreadSync::unpause_all
+/// (`verona-rt/src/rt/sched/threadsync.h:108-128`,
+/// `threadpool.h:367-373`) which wakes every waiter on the global
+/// waiter list. The broadcast lets every parker re-run
+/// @c boc_sched_any_work_visible() and either dequeue locally or
+/// initiate a steal; parkers that find no work re-loop and re-park.
+/// Early-outs when @c PARKED_COUNT is observed as zero so the common
+/// no-parker case stays cheap. Safe to pass @p self == NULL (skips
+/// the walk).
+void boc_sched_unpause_all(boc_sched_worker_t *self);
+
+/// @brief Lock-then-signal a specific worker.
+/// @details Used by the producer fast arm to deliver a targeted wake
+/// when the off-worker dispatch path lands a behaviour on @p target.
+/// No-op if @p target is NULL or already non-parked.
+void boc_sched_signal_one(boc_sched_worker_t *target);
+
+/// @brief Read the calling thread's @c current_worker TLS slot.
+/// @details Returns the worker handle installed by the most recent
+/// @ref boc_sched_worker_register on this thread, or NULL if the
+/// thread has never registered. Lets call sites in @c _core.c reach
+/// into the TLS without each TU having to declare its own
+/// @c thread_local mirror.
+boc_sched_worker_t *boc_sched_current_worker(void);
+
+// ---------------------------------------------------------------------------
+// Dispatch + fast-path pop
+// ---------------------------------------------------------------------------
+//
+// @ref boc_sched_dispatch is the producer-side entry point. Production
+// callers in @c _core.c invoke it as
+// @c boc_sched_dispatch(&behavior->bq_node); test code reaches it via
+// @c _core.scheduler_dispatch_node / @c _core.scheduler_pop_fast.
+
+/// @brief Schedule a behaviour for execution.
+/// @details Producer-side dispatch with two arms (chosen by whether
+/// the calling thread is registered as a worker):
+///
+/// **Producer-local arm** (`current_worker != NULL`). Verona
+/// `schedule_fifo` semantics
+/// (`schedulerthread.h:86-101`): always evict the prior @c pending
+/// to the worker's local queue and install @p n as the new
+/// @c pending. Result: the most-recent dispatch runs first when the
+/// worker reaches @ref boc_sched_worker_pop_fast, which is the
+/// cache-friendly behaviour Verona was tuned for. No targeted wake
+/// is issued because the producer is itself the worker that will
+/// run the work.
+///
+/// **Off-worker arm** (`current_worker == NULL`). The main thread
+/// (or any non-worker thread) picks a target from the worker ring
+/// using a TLS round-robin cursor that re-seeds whenever the
+/// scheduler incarnation changes. The behaviour is enqueued
+/// directly onto the target's @c q, then a targeted
+/// @ref boc_sched_signal_one wake is issued.
+///
+/// **Slow arm (both producers).** After publish, the
+/// pause/unpause-aware wake fires: load `(pe, ue)`; if `pe != ue` a
+/// CAS forwards `unpause_epoch` to `pause_epoch`; the CAS winner
+/// calls @ref boc_sched_unpause_all to wake every parked peer. This
+/// closes the producer-on-other-worker liveness gap.
+///
+/// **No-runtime case.** If no workers are registered (off-worker
+/// arm with @c WORKER_COUNT == 0), the function sets a
+/// @c RuntimeError ("scheduler not running") and returns -1. The
+/// caller must propagate the failure so the corresponding
+/// @c terminator_inc / queue-side reservation is rolled back; the
+/// reference behaviour is in @c whencall in @c behaviors.py
+/// (try/except around @c BehaviorCapsule.schedule that drops the
+/// terminator hold).
+///
+/// @param n The behaviour's @c bq_node (typically
+///          @c &behavior->bq_node from @c _core.c).
+/// @return 0 on success, -1 on failure (Python exception set).
+int boc_sched_dispatch(boc_bq_node_t *n);
+
+/// @brief Fast-path consumer pop.
+/// @details Returns the calling worker's pending-or-queue-head
+/// behaviour without parking. NULL means the local fast paths are
+/// dry; the caller falls through to @ref boc_sched_worker_pop_slow
+/// for the steal/park arm.
+/// @param self The calling worker (typically
+///             @ref boc_sched_current_worker()).
+/// @return The dequeued node, or NULL if pending and the local
+///         queue are both empty.
+boc_bq_node_t *boc_sched_worker_pop_fast(boc_sched_worker_t *self);
+
+// ---------------------------------------------------------------------------
+// Build-time feature gate
+// ---------------------------------------------------------------------------
+//
+// `BOC_HAVE_TRY_STEAL` toggles the parker's `check_for_work` walk
+// between "inspect own queue only" (off) and "walk the full ring"
+// (on). Defined unconditionally here; the off mode is reserved for
+// debugging and is not part of any supported build.
+#define BOC_HAVE_TRY_STEAL 1
+
+/// @brief Test whether any worker's queue currently has visible work.
+/// @details Walks the full worker array and calls
+/// @ref boc_wsq_is_empty on each worker, which itself scans all
+/// @c BOC_WSQ_N sub-queues of that worker. Returns @c true on the
+/// first non-empty sub-queue found, @c false if every sub-queue of
+/// every worker is empty. Cheap: bounded by
+/// @c WORKER_COUNT * BOC_WSQ_N @c boc_bq_is_empty reads, each a
+/// single acquire-load on the queue's @c front pointer. Mirrors
+/// Verona's parker-side @c check_for_work walk
+/// ([`threadpool.h::check_for_work`](../../verona-rt/src/rt/sched/threadpool.h)),
+/// gated on @c BOC_HAVE_TRY_STEAL.
+///
+/// **Memory ordering.** Each @ref boc_bq_is_empty read is acquire-
+/// ordered. The full walk is *not* a snapshot — a producer racing
+/// with this call may publish onto a queue we have already passed.
+/// That is fine: the parker has already bumped @c PAUSE_EPOCH
+/// (seq_cst) before calling this, so the racing producer is forced
+/// into the CAS arm and will signal a parker if needed (see
+/// @ref boc_sched_unpause_all). Returning a stale @c false is the
+/// only race outcome, and the epoch re-check under @c cv_mu
+/// catches it before the worker actually sleeps.
+/// @return @c true if at least one worker has visible queue work.
+bool boc_sched_any_work_visible(void);
+
+// ---------------------------------------------------------------------------
+// Per-worker fairness token (`token_work`)
+// ---------------------------------------------------------------------------
+//
+// Each worker owns a `BOCBehavior`-shaped sentinel whose `is_token`
+// discriminator is set to 1. The token is allocated by
+// `_core_scheduler_runtime_start` (because it knows the
+// `BOCBehavior` layout) and installed into the worker's
+// `token_work` slot via @ref boc_sched_set_token_node. On every
+// successful pop, the dispatch site checks `is_token`; if set, the
+// popping worker flips its `should_steal_for_fairness` flag and
+// re-enqueues the token instead of running user code. Verona ports:
+// `Core::token_work` (`core.h:22-37`), token-thunk dequeue
+// (`schedulerthread.h::run_inner`).
+
+/// @brief Install the per-worker fairness token's queue node.
+/// @details Stores @p node into @c WORKERS[worker_index].token_work
+/// using @c BOC_MO_RELEASE so a subsequent acquire-load on a worker
+/// thread observes the install. Idempotent overwrite (callers are
+/// expected to call this at most once per worker per runtime
+/// cycle); @p node may be NULL to clear the slot during shutdown.
+/// Must be called with the GIL held.
+/// @param worker_index Zero-based worker slot.
+/// @param node The token's @c bq_node pointer (typically
+///             @c &token_behavior->bq_node), or NULL to clear.
+/// @return 0 on success, -1 if @p worker_index is out of range. No
+///         Python exception is set on -1.
+int boc_sched_set_token_node(Py_ssize_t worker_index, boc_bq_node_t *node);
+
+/// @brief Read the per-worker fairness token's queue node.
+/// @details Acquire-load of the @c token_work slot. Returns NULL if
+/// no token is installed (pre-install or after a
+/// @c boc_sched_set_token_node(.., NULL)). Used by the runtime
+/// teardown path to recover the token pointer before freeing the
+/// per-worker array.
+/// @param worker_index Zero-based worker slot.
+/// @return The installed token node, or NULL.
+boc_bq_node_t *boc_sched_get_token_node(Py_ssize_t worker_index);
+
+/// @brief Set the calling worker's @c should_steal_for_fairness flag.
+/// @details Release-store of @p value into
+/// @c self->should_steal_for_fairness. This is the C-side body of the
+/// Verona token closure
+/// ([`core.h:28-33`](../../verona-rt/src/rt/sched/core.h#L28)): when
+/// the dispatch site at @ref _core_scheduler_worker_pop pops a node
+/// whose owning behaviour has @c is_token set, it calls this with
+/// @p value = true so the next @ref boc_sched_worker_pop_slow
+/// iteration takes the steal-for-fairness arm. Exposed as a sched
+/// helper to keep the per-worker layout opaque to the dispatch TU.
+/// @param self The calling worker (must be the result of
+///             @ref boc_sched_current_worker on this thread).
+/// @param value New flag value (typically @c true from the token
+///              thunk; @c false at the steal arm before re-enqueueing
+///              the token).
+void boc_sched_set_steal_flag(boc_sched_worker_t *self, bool value);
+
+#endif // BOCPY_SCHED_H
diff --git a/src/bocpy/tags.c b/src/bocpy/tags.c
new file mode 100644
index 0000000..e3bfdd1
--- /dev/null
+++ b/src/bocpy/tags.c
@@ -0,0 +1,108 @@
+/// @file tags.c
+/// @brief Out-of-line implementations for the message-tag API.
+///
+/// Hot-path operations (incref / decref / disable check) are
+/// `static inline` in `tags.h`; this TU houses the cold helpers
+/// (alloc / free / unicode bridges / comparisons).
+
+#define PY_SSIZE_T_CLEAN
+
+#include <Python.h>
+#include <string.h>
+
+#include "tags.h"
+
+BOCTag *tag_from_PyUnicode(PyObject *unicode, BOCQueue *queue) {
+  if (!PyUnicode_CheckExact(unicode)) {
+    PyErr_SetString(PyExc_TypeError, "Must be a str");
+    return NULL;
+  }
+
+  BOCTag *tag = (BOCTag *)PyMem_RawMalloc(sizeof(BOCTag));
+  if (tag == NULL) {
+    PyErr_NoMemory();
+    return NULL;
+  }
+
+  Py_ssize_t size = -1;
+  const char *str = PyUnicode_AsUTF8AndSize(unicode, &size);
+  if (str == NULL) {
+    // PyUnicode_AsUTF8AndSize sets the exception (UnicodeEncodeError on
+    // surrogates, etc.). Free the partial allocation before returning.
+    PyMem_RawFree(tag);
+    return NULL;
+  }
+
+  tag->size = size;
+  tag->str = (char *)PyMem_RawMalloc(tag->size + 1);
+
+  if (tag->str == NULL) {
+    PyErr_NoMemory();
+    PyMem_RawFree(tag);
+    return NULL;
+  }
+
+  memcpy(tag->str, str, tag->size + 1);
+  tag->queue = queue;
+  // Return with rc = 1: callers receive an owning reference. The prior
+  // rc = 0 idiom required every caller to TAG_INCREF immediately after
+  // the publish-store, but the publish-then-incref window left the
+  // tag visible to peers at rc = 0 and a racing TAG_DECREF could free
+  // it before the publisher's INCREF ran.
+  atomic_store(&tag->rc, 1);
+  atomic_store(&tag->disabled, 0);
+
+  return tag;
+}
+
+PyObject *tag_to_PyUnicode(BOCTag *tag) {
+  return PyUnicode_FromStringAndSize(tag->str, tag->size);
+}
+
+void BOCTag_free(BOCTag *tag) {
+  PyMem_RawFree(tag->str);
+  PyMem_RawFree(tag);
+}
+
+int tag_compare_with_utf8(BOCTag *lhs, const char *rhs_str,
+                          Py_ssize_t rhs_size) {
+  Py_ssize_t size = lhs->size < rhs_size ? lhs->size : rhs_size;
+  char *lhs_ptr = lhs->str;
+  const char *rhs_ptr = rhs_str;
+  for (Py_ssize_t i = 0; i < size; ++i, ++lhs_ptr, ++rhs_ptr) {
+    int8_t a = (int8_t)(*lhs_ptr);
+    int8_t b = (int8_t)(*rhs_ptr);
+
+    if (a < b) {
+      return -1;
+    }
+    if (a > b) {
+      return 1;
+    }
+  }
+
+  if (lhs->size < rhs_size) {
+    return -1;
+  }
+
+  if (lhs->size > rhs_size) {
+    return 1;
+  }
+
+  return 0;
+}
+
+int tag_compare_with_PyUnicode(BOCTag *lhs, PyObject *rhs_op) {
+  if (!PyUnicode_CheckExact(rhs_op)) {
+    PyErr_SetString(PyExc_TypeError, "Must be a str");
+    return -2;
+  }
+
+  Py_ssize_t rhs_size = -1;
+  const char *rhs_str = PyUnicode_AsUTF8AndSize(rhs_op, &rhs_size);
+  if (rhs_str == NULL) {
+    return -2;
+  }
+
+  return tag_compare_with_utf8(lhs, rhs_str, rhs_size);
+}
diff --git a/src/bocpy/tags.h b/src/bocpy/tags.h
new file mode 100644
index 0000000..7bd6e2c
--- /dev/null
+++ b/src/bocpy/tags.h
@@ -0,0 +1,113 @@
+/// @file tags.h
+/// @brief Message-tag table API shared between TUs.
+///
+/// A `BOCTag` names a message stream and pins one of the 16 fixed
+/// `BOCQueue` slots in the global queue table. Tags are reference
+/// counted so they can be cached on the per-interpreter
+/// @c _core_module_state.queue_tags[] array without leaking the queue
+/// slot. Hot-path operations (incref / decref / disable check) are
+/// `static inline` so every TU that includes this header gets the same
+/// inlined code as the original `_core.c`.
+///
+/// @note `BOCQueue` itself stays opaque here — `BOCTag` only stores a
+/// `BOCQueue *` and never reaches into the queue body.
+
+#ifndef BOCPY_TAGS_H
+#define BOCPY_TAGS_H
+
+#define PY_SSIZE_T_CLEAN
+
+#include <Python.h>
+
+#include "compat.h"
+
+/// @brief Forward declaration. Body defined in `_core.c` (later
+/// `message_queue.h`); tags only carry a pointer.
+typedef struct boc_queue BOCQueue;
+
+/// @brief A tag for a BOC message.
+typedef struct boc_tag {
+  /// @brief The UTF-8 string value of the tag
+  char *str;
+  /// @brief The number of bytes in str (not including the NULL)
+  Py_ssize_t size;
+  /// @brief A pointer to the queue that this tag is associated with
+  BOCQueue *queue;
+  atomic_int_least64_t rc;
+  atomic_int_least64_t disabled;
+} BOCTag;
+
+/// @brief Creates a new BOCTag object from a Python Unicode string.
+/// @details The result object will not be dependent on the argument in any way
+/// (i.e., it can be safely deallocated). On success the returned tag has
+/// reference count 1; the caller owns one reference and must arrange for
+/// it to be released via @c TAG_DECREF (or @c BOCTag_free for non-rc
+/// owners such as the @c BehaviorCapsule thunk path) when no longer
+/// needed. On failure (non-str argument, UTF-8 encoding error, OOM)
+/// returns NULL with a Python exception set; no partial state is left
+/// behind.
+/// @param unicode A PyUnicode object
+/// @param queue The queue to associate with this tag
+/// @return a new BOCTag object with rc=1, or NULL on failure
+BOCTag *tag_from_PyUnicode(PyObject *unicode, BOCQueue *queue);
+
+/// @brief Converts a BOCTag to a PyUnicode object.
+/// @note This method uses PyUnicode_FromStringAndSize() internally.
+/// @param tag The tag to convert
+/// @return A new reference to a PyUnicode object.
+PyObject *tag_to_PyUnicode(BOCTag *tag);
+
+/// @brief Frees a BOCTag object and any associated memory.
+/// @param tag The tag to free
+void BOCTag_free(BOCTag *tag);
+
+/// @brief Compares a BOCTag with a UTF8 string.
+/// @details -1 if the tag should be placed before, 1 if after, 0 if equivalent
+/// @param lhs The BOCtag to compare
+/// @param rhs_str The string to compare with
+/// @param rhs_size The length of the comparison string
+/// @return -1 if before, 1 if after, 0 if equivalent
+int tag_compare_with_utf8(BOCTag *lhs, const char *rhs_str,
+                          Py_ssize_t rhs_size);
+
+/// @brief Compares a BOCTag with a PyUnicode object.
+/// @details -1 if the tag should be placed before, 1 if after, 0 if equivalent
+/// @param lhs The BOCtag to compare
+/// @param rhs_op The PyUnicode to compare with
+/// @return -1 if before, 1 if after, 0 if equivalent. -2 on error.
+int tag_compare_with_PyUnicode(BOCTag *lhs, PyObject *rhs_op);
+
+// ---------------------------------------------------------------------------
+// Hot-path inlines.
+//
+// These were `static` in `_core.c` and called via the TAG_INCREF /
+// TAG_DECREF macros on the send / receive / set_tags paths. Promoting
+// them to `static inline` in this header preserves the inlining when
+// the macros are used from any including TU (and matches CPython's
+// `Py_INCREF` / `Py_DECREF` header-inline pattern).
+// ---------------------------------------------------------------------------
+
+static inline int_least64_t tag_decref(BOCTag *tag) {
+  int_least64_t rc = atomic_fetch_add(&tag->rc, -1) - 1;
+  if (rc == 0) {
+    BOCTag_free(tag);
+  }
+
+  return rc;
+}
+
+#define TAG_DECREF(t) tag_decref(t)
+
+static inline int_least64_t tag_incref(BOCTag *tag) {
+  return atomic_fetch_add(&tag->rc, 1) + 1;
+}
+
+#define TAG_INCREF(t) tag_incref(t)
+
+static inline bool tag_is_disabled(BOCTag *tag) {
+  return atomic_load(&tag->disabled);
+}
+
+static inline void tag_disable(BOCTag *tag) { atomic_store(&tag->disabled, 1); }
+
+#endif // BOCPY_TAGS_H
diff --git a/src/bocpy/terminator.c b/src/bocpy/terminator.c
new file mode 100644
index 0000000..c005990
--- /dev/null
+++ b/src/bocpy/terminator.c
@@ -0,0 +1,120 @@
+/// @file terminator.c
+/// @brief Implementation of the process-global rundown counter.
+///
+/// All state lives in file-scope statics so that every sub-interpreter
+/// in the same process shares one counter, mutex, and condvar. See
+/// `terminator.h` for the public API and lifecycle contract.
+
+#include "terminator.h"
+
+#include "compat.h"
+
+/// @brief Active behavior count + the Pyrona seed.
+static atomic_int_least64_t TERMINATOR_COUNT = 0;
+
+/// @brief Set to 1 by terminator_close() to refuse further increments.
+static atomic_int_least64_t TERMINATOR_CLOSED = 0;
+
+/// @brief One-shot guard for the Pyrona seed: 1 = seed still present.
+static atomic_int_least64_t TERMINATOR_SEEDED = 0;
+
+/// @brief Mutex protecting TERMINATOR_COND.
+static BOCMutex TERMINATOR_MUTEX;
+
+/// @brief Condition variable signalled when TERMINATOR_COUNT reaches 0.
+static BOCCond TERMINATOR_COND;
+
+void terminator_init(void) {
+  // The Pyrona seed (count=1, seeded=1) is set by terminator_reset()
+  // when the runtime starts; here we only initialize the kernel
+  // objects.
+  boc_mtx_init(&TERMINATOR_MUTEX);
+  cnd_init(&TERMINATOR_COND);
+}
+
+int_least64_t terminator_inc(void) {
+  if (atomic_load(&TERMINATOR_CLOSED)) {
+    return -1;
+  }
+  int_least64_t newval = atomic_fetch_add(&TERMINATOR_COUNT, 1) + 1;
+  if (atomic_load(&TERMINATOR_CLOSED)) {
+    int_least64_t after = atomic_fetch_add(&TERMINATOR_COUNT, -1) - 1;
+    if (after == 0) {
+      mtx_lock(&TERMINATOR_MUTEX);
+      cnd_broadcast(&TERMINATOR_COND);
+      mtx_unlock(&TERMINATOR_MUTEX);
+    }
+    return -1;
+  }
+  return newval;
+}
+
+int_least64_t terminator_dec(void) {
+  int_least64_t newval = atomic_fetch_add(&TERMINATOR_COUNT, -1) - 1;
+  if (newval == 0) {
+    mtx_lock(&TERMINATOR_MUTEX);
+    cnd_broadcast(&TERMINATOR_COND);
+    mtx_unlock(&TERMINATOR_MUTEX);
+  }
+  return newval;
+}
+
+void terminator_close(void) { atomic_store(&TERMINATOR_CLOSED, 1); }
+
+bool terminator_wait(double timeout, bool wait_forever) {
+  bool ok = true;
+  double end_time = wait_forever ? 0.0 : boc_now_s() + timeout;
+  mtx_lock(&TERMINATOR_MUTEX);
+  while (atomic_load(&TERMINATOR_COUNT) != 0) {
+    if (!wait_forever) {
+      double now = boc_now_s();
+      if (now >= end_time) {
+        ok = false;
+        break;
+      }
+      cnd_timedwait_s(&TERMINATOR_COND, &TERMINATOR_MUTEX, end_time - now);
+    } else {
+      cnd_wait(&TERMINATOR_COND, &TERMINATOR_MUTEX);
+    }
+  }
+  mtx_unlock(&TERMINATOR_MUTEX);
+  return ok;
+}
+
+bool terminator_seed_dec(void) {
+  int_least64_t prev = atomic_exchange(&TERMINATOR_SEEDED, 0);
+  if (prev == 1) {
+    int_least64_t newval = atomic_fetch_add(&TERMINATOR_COUNT, -1) - 1;
+    if (newval == 0) {
+      mtx_lock(&TERMINATOR_MUTEX);
+      cnd_broadcast(&TERMINATOR_COND);
+      mtx_unlock(&TERMINATOR_MUTEX);
+    }
+    return true;
+  }
+  return false;
+}
+
+void terminator_reset(int_least64_t *prior_count, int_least64_t *prior_seeded) {
+  // Fence: raise the closed bit before we touch anything else so any
+  // stray thread still holding a reference to the previous runtime
+  // (e.g. a late whencall call) is refused by terminator_inc rather
+  // than slipping a new behavior past the reset boundary. We clear
+  // the bit again at the end, once the new COUNT/SEEDED values have
+  // been published, so a fresh start() sees closed=0.
+  atomic_store(&TERMINATOR_CLOSED, 1);
+  mtx_lock(&TERMINATOR_MUTEX);
+  *prior_count = atomic_load(&TERMINATOR_COUNT);
+  *prior_seeded = atomic_load(&TERMINATOR_SEEDED);
+  atomic_store(&TERMINATOR_COUNT, 1);
+  atomic_store(&TERMINATOR_SEEDED, 1);
+  atomic_store(&TERMINATOR_CLOSED, 0);
+  cnd_broadcast(&TERMINATOR_COND);
+  mtx_unlock(&TERMINATOR_MUTEX);
+}
+
+int_least64_t terminator_seeded(void) {
+  return atomic_load(&TERMINATOR_SEEDED);
+}
+
+int_least64_t terminator_count(void) { return atomic_load(&TERMINATOR_COUNT); }
diff --git a/src/bocpy/terminator.h b/src/bocpy/terminator.h
new file mode 100644
index 0000000..b4ad4f6
--- /dev/null
+++ b/src/bocpy/terminator.h
@@ -0,0 +1,84 @@
+/// @file terminator.h
+/// @brief Process-global rundown counter API shared between TUs.
+///
+/// The terminator is the C-level barrier that gates `Behaviors.wait()` /
+/// `stop()`. Increment from caller threads in `whencall` (before the
+/// schedule call) and decrement from worker threads after
+/// `behavior_release_all` completes. A one-shot "Pyrona seed" of 1 keeps
+/// the count positive between the runtime starting and `stop()` taking
+/// it down via @ref terminator_seed_dec.
+///
+/// State is process-global (file-scope statics in `terminator.c`, NOT
+/// per-interpreter) so every sub-interpreter sees the same counter,
+/// mutex, and condvar.
+///
+/// Lifecycle:
+///   - @ref terminator_reset arms a fresh runtime: count = 1 (the seed),
+///     seeded = 1, closed = 0. Returns the prior `(count, seeded)` so
+///     `Behaviors.start` can detect drift carried over from a previous
+///     run that died without reconciliation.
+///   - @ref terminator_inc returns -1 once @ref terminator_close has
+///     been called, so the `whencall` fast path can refuse new work
+///     without racing teardown.
+///   - @ref terminator_seed_dec is the idempotent one-shot that drops
+///     the seed; subsequent calls are no-ops.
+///   - @ref terminator_wait blocks on the condvar until count reaches 0.
+///   - @ref terminator_close raises the closed bit so any straggler
+///     @ref terminator_inc returns -1.
+
+#ifndef BOCPY_TERMINATOR_H
+#define BOCPY_TERMINATOR_H
+
+#include <stdbool.h>
+#include <stdint.h>
+
+/// @brief Initialize the terminator mutex and condvar.
+/// @details Called once from `_core_module_exec` on first interpreter
+/// load. The kernel objects intentionally outlive module unload (no
+/// matching destroy), matching the original behaviour in `_core.c`.
+void terminator_init(void);
+
+/// @brief Increment the counter, refusing if closed.
+/// @return Post-increment count on success, or -1 if the terminator is
+///         closed (runtime is shutting down).
+int_least64_t terminator_inc(void);
+
+/// @brief Decrement the counter. Wakes @ref terminator_wait on
+///        0-transition.
+/// @return The new count.
+int_least64_t terminator_dec(void);
+
+/// @brief Set the closed bit. Future @ref terminator_inc calls return
+///        -1.
+void terminator_close(void);
+
+/// @brief Block until the counter reaches 0.
+/// @details Caller MUST release the GIL before invoking. A negative
+/// @p timeout or @p wait_forever means wait forever.
+/// @param timeout Maximum wait in seconds. Ignored if @p wait_forever.
+/// @param wait_forever If true, ignore @p timeout and wait until
+///                    signalled.
+/// @return true on success, false on timeout.
+bool terminator_wait(double timeout, bool wait_forever);
+
+/// @brief Idempotent one-shot decrement of the Pyrona seed.
+/// @return true if this call removed the seed, false if it was already
+///         removed.
+bool terminator_seed_dec(void);
+
+/// @brief Restore terminator state for a fresh runtime start.
+/// @details Sets count=1 (seed), clears the closed bit, and re-arms the
+/// seed one-shot. Returns the prior `(count, seeded)` via the out
+/// parameters so callers can detect drift from a previous run that
+/// died without reaching its reconciliation point.
+/// @param prior_count Out param for the prior count.
+/// @param prior_seeded Out param for the prior seeded flag.
+void terminator_reset(int_least64_t *prior_count, int_least64_t *prior_seeded);
+
+/// @brief Read the current seeded flag.
+int_least64_t terminator_seeded(void);
+
+/// @brief Read the current counter.
+int_least64_t terminator_count(void);
+
+#endif // BOCPY_TERMINATOR_H
diff --git a/src/bocpy/transpiler.py b/src/bocpy/transpiler.py
index a84b32c..4608f69 100644
--- a/src/bocpy/transpiler.py
+++ b/src/bocpy/transpiler.py
@@ -7,6 +7,15 @@
 from typing import Mapping, NamedTuple, Set
 
 
+def _has_when_decorator(node: ast.FunctionDef) -> bool:
+    """Return True if the function carries an ``@when(...)`` decorator."""
+    for dec in node.decorator_list:
+        if (isinstance(dec, ast.Call) and isinstance(dec.func, ast.Name)
+                and dec.func.id == "when"):
+            return True
+    return False
+
+
 class CapturedVariableFinder(ast.NodeVisitor):
     """Finds captured variables in a FunctionDef."""
 
@@ -41,6 +50,21 @@ def visit_FunctionDef(self, node: ast.FunctionDef):  # noqa: N802
         for stmt in node.body:
             if isinstance(stmt, ast.FunctionDef):
                 self.local_vars.add(stmt.name)
+                # A nested @when is rewritten by WhenTransformer into a
+                # whencall(...) at this position. The cown arguments and the
+                # capture tuple are evaluated in *this* (outer) frame, so any
+                # free names they reference must appear in the outer
+                # behavior's captures. Plain nested def's keep their normal
+                # opaque treatment because Python's own closure handles them.
+                if _has_when_decorator(stmt):
+                    inner = CapturedVariableFinder(self.known_vars)
+                    inner.visit(stmt)
+                    self.used_vars |= inner.captured_vars
+                    for dec in stmt.decorator_list:
+                        if (isinstance(dec, ast.Call) and isinstance(dec.func, ast.Name)
+                                and dec.func.id == "when"):
+                            for arg in dec.args:
+                                self.visit(arg)
                 continue
 
             self.generic_visit(stmt)
diff --git a/src/bocpy/worker.py b/src/bocpy/worker.py
index 0b6185d..2db90a4 100644
--- a/src/bocpy/worker.py
+++ b/src/bocpy/worker.py
@@ -1,6 +1,5 @@
 """Worker process that runs exported behaviors in subinterpreters."""
 
-import importlib.util
 import logging
 import sys
 
@@ -12,89 +11,91 @@
 logger = logging.getLogger(f"worker{index}")
 
 
-def load_boc_module(module_name, file_path):
-    """Load a bocpy-exported module into this interpreter."""
-    logger.debug(f"Loading bocpy export {module_name} from {file_path}")
-    # Create a module specification from the file location
-    spec = importlib.util.spec_from_file_location(module_name, file_path)
-
-    # Create a new module based on the spec
-    module = importlib.util.module_from_spec(spec)
-
-    # Register the module in sys.modules
-    sys.modules[module_name] = module
-
-    # Execute the module
-    spec.loader.exec_module(module)
-
-
 boc_export = None
 
-# The boc_export module and any of its classes which are needed for unpickling
-# are loaded and aliased within these tags when the worker script is generated.
+# The boc_export module and any of its classes which are needed for
+# unpickling are loaded and aliased within these tags when the worker
+# script is generated. The transpiled source is embedded as a Python
+# string literal (via ``repr()``) and exec'd into a fresh
+# ``types.ModuleType``; a ``linecache`` entry under a synthetic
+# filename ``<bocpy:NAME>`` keeps tracebacks pointing at the
+# transpiled source line. No on-disk artifact is created.
 
 # BEGIN boc_export
 # END boc_export
 
 
 def run_behavior(behavior):
-    """Execute a single behavior and release its requests inline."""
+    """Execute a single behavior and release its requests inline.
+
+    Layered ``try/finally`` blocks guarantee the MCS unlink and the
+    terminator decrement run even when the body raises a non-``Exception``
+    ``BaseException`` (``KeyboardInterrupt``, ``SystemExit``,
+    ``PythonFinalizationError`` since 3.13). Such an exception is **not**
+    caught here — it propagates upward through the finallies, which is
+    exactly what we want: every cleanup step still runs, and the outer
+    worker loop in :func:`do_work` re-raises so the worker exits cleanly.
+    Only ``Exception`` (user-code errors, transient C failures) is
+    explicitly handled and logged.
+    """
     try:
+        acquired = False
         try:
-            _core.noticeboard_cache_clear()
-            behavior.acquire()
-        except Exception as ex:
-            # acquire() / cache_clear() failed before the body ran. The
-            # MCS chain for this behavior is still linked (behavior_schedule
-            # established the links on the caller thread), so we must
-            # unwind it here or every successor blocks forever. Mark
-            # the result Cown with the exception so any caller awaiting
-            # it sees a diagnostic instead of a permanent None.
-            logger.exception(ex)
             try:
-                behavior.set_exception(ex)
-            except Exception as inner:
-                logger.exception(inner)
+                _core.noticeboard_cache_clear()
+                behavior.acquire()
+                acquired = True
+            except Exception as ex:
+                # acquire() / cache_clear() failed before the body ran.
+                # The MCS chain is still linked (behavior_schedule
+                # established the links on the caller thread), so the
+                # outer finally below MUST run release/release_all to
+                # unwind it -- otherwise every successor blocks forever.
+                # Mark the result Cown so a caller awaiting it sees a
+                # diagnostic instead of a permanent None.
+                logger.exception(ex)
+                try:
+                    behavior.set_exception(ex)
+                except Exception as inner:
+                    logger.exception(inner)
+                # Fall through: `acquired` is False, so we skip execute()
+                # but still run the release pair in the outer finally.
+
+            if acquired:
+                try:
+                    behavior.execute(boc_export)
+                except Exception as ex:
+                    logger.exception(ex)
+                    behavior.set_exception(ex)
+        finally:
+            # Runs on every path: clean acquire, failed acquire, normal
+            # body return, body Exception, OR body KI/SystemExit (which
+            # propagates after this finally completes).
+            #
             # acquire() is sequential (result -> args -> captures) and
             # bails on first failure, so on a partial-success raise some
             # cowns are owned by this worker and some are not. release()
-            # is similarly tolerant (it short-circuits NO_OWNER cowns),
-            # so calling it here releases the ones we did acquire before
-            # release_all hands the request to a successor. Without this
-            # the successor's cown_acquire fails with "already acquired
-            # by <this interp>" and every behavior on that cown strands.
+            # is tolerant (it short-circuits NO_OWNER cowns), so calling
+            # it here releases the ones we did acquire before
+            # release_all hands the request to a successor.
             try:
                 behavior.release()
-            except Exception as inner:
-                logger.exception(inner)
+            except Exception as ex:
+                logger.exception(ex)
+            # Release the request array on the worker thread instead of
+            # round-tripping ("release", capsule) through the (now-gone)
+            # central scheduler thread.
             try:
                 behavior.release_all()
-            except Exception as inner:
-                logger.exception(inner)
-            return
-
-        try:
-            behavior.execute(boc_export)
-        except Exception as ex:
-            logger.exception(ex)
-            behavior.set_exception(ex)
-
-        try:
-            behavior.release()
-        except Exception as ex:
-            logger.exception(ex)
-        # Release the request array on the worker thread instead of
-        # round-tripping ("release", capsule) through the (now-gone)
-        # central scheduler thread.
-        try:
-            behavior.release_all()
-        except Exception as ex:
-            logger.exception(ex)
+            except Exception as ex:
+                logger.exception(ex)
     finally:
         # Drop the terminator hold unconditionally. If anything above
-        # raised, failing to decrement here would leave wait() hung
-        # forever. Log and swallow so a single misbehaving worker step
-        # cannot strand the runtime.
+        # raised (Exception or BaseException), failing to decrement
+        # here would leave wait() hung forever. Log and swallow
+        # Exception so a single misbehaving step cannot strand the
+        # runtime; KI/SystemExit from terminator_dec itself is
+        # extraordinarily unlikely (pure C atomic) and would propagate.
         try:
             _core.terminator_dec()
         except Exception as ex:
@@ -104,24 +105,43 @@ def run_behavior(behavior):
 def do_work():
     """Main worker loop receiving behaviors or shutdown messages."""
     try:
-        running = True
         logger.debug("worker starting")
+        # Claim a scheduler slot and stamp the per-thread TLS handle
+        # before announcing readiness. Subsequent dispatch / pop paths
+        # rely on this slot being installed. If registration fails
+        # (over-spawn vs. scheduler_runtime_start), surface the error
+        # so start_workers stops waiting.
+        try:
+            slot = _core.scheduler_worker_register()
+            logger.debug("registered scheduler slot %d", slot)
+        except Exception as ex:
+            logger.exception(ex)
+            send("boc_behavior", f"register failed: {ex}")
+            return
         send("boc_behavior", "started")
-        while running:
+        while True:
             try:
-                match receive("boc_worker"):
-                    case ["boc_worker", "shutdown"]:
-                        logger.debug("boc_worker/shutdown")
-                        running = False
-
-                    case ["boc_worker", behavior]:
-                        run_behavior(behavior)
-                        behavior = None
+                # scheduler_worker_pop blocks on the worker's own
+                # condvar (with the GIL released). It returns None
+                # only when scheduler_request_stop_all has been
+                # called by stop_workers.
+                behavior = _core.scheduler_worker_pop()
+                if behavior is None:
+                    logger.debug("scheduler stop signal received")
+                    break
+                run_behavior(behavior)
+                behavior = None
+            except (KeyboardInterrupt, SystemExit):
+                # Propagate so the worker can wind down: the outer
+                # try/finally still sends "shutdown" before the
+                # interpreter exits, so stop_workers does not hang.
+                raise
             except Exception as ex:
-                # A failure inside run_behavior or receive must not
-                # break the loop -- if it did, this worker would exit
-                # without sending its "shutdown" reply and stop_workers
-                # would block forever waiting for it.
+                # A regular Exception inside run_behavior or
+                # scheduler_worker_pop must not break the loop -- if
+                # it did, this worker would exit without sending its
+                # "shutdown" reply and stop_workers would block forever
+                # waiting for it.
                 logger.exception(ex)
 
         logger.debug("worker stopped")
@@ -163,12 +183,30 @@ def cleanup():
         logger.exception(ex)
 
 
-do_work()
-cleanup()
-
-logger = None
-
-# in Python 3.12 and prior, the threading module can cause issues with
-# subinterpreter destruction
-del sys.modules["logging"]
-del sys.modules["threading"]
+try:
+    do_work()
+finally:
+    # Always run cleanup, even if do_work() bubbled out a
+    # KeyboardInterrupt / SystemExit / PythonFinalizationError.
+    # Skipping cleanup leaves XIData objects live inside this
+    # sub-interpreter; subsequent destruction then fails with
+    # "interpreter has live cross-interpreter data" and the
+    # worker pool teardown blocks.
+    #
+    # The post-cleanup `sys.modules` clears below are also
+    # destruction-critical on Python 3.12 and prior, so they live in
+    # an inner `finally` that runs even if `cleanup()` itself raises
+    # a BaseException (e.g. KeyboardInterrupt parking inside
+    # `receive("boc_cleanup")`, or PythonFinalizationError out of
+    # `_core.recycle()`). Skipping them re-introduces the
+    # subinterpreter-destruction wedge in mirror image.
+    try:
+        cleanup()
+    finally:
+        logger = None
+        # in Python 3.12 and prior, the threading module can cause
+        # issues with subinterpreter destruction. `pop(..., None)`
+        # is used instead of `del` so a module already removed by
+        # an earlier failure path does not raise KeyError here.
+        for _modname in ("logging", "threading"):
+            sys.modules.pop(_modname, None)
diff --git a/src/bocpy/xidata.h b/src/bocpy/xidata.h
new file mode 100644
index 0000000..60b0dc7
--- /dev/null
+++ b/src/bocpy/xidata.h
@@ -0,0 +1,206 @@
+/// @file xidata.h
+/// @brief Cross-interpreter data (XIData) compatibility shim for bocpy.
+///
+/// CPython's cross-interpreter data API has changed names and semantics
+/// across releases:
+///   - 3.14+: `_PyXIData_*` / `_PyXIData_t`
+///   - 3.13:  `_PyCrossInterpreterData_*` / `_PyCrossInterpreterData`
+///   - 3.12:  `_PyCrossInterpreterData_*`, partial API
+///   - <3.12: no multi-GIL support — provides a stub `xidata_init` so
+///            the code compiles, but `BOC_NO_MULTIGIL` is defined and
+///            features that depend on cross-interpreter sharing are
+///            compiled out at the call site.
+///
+/// Before this TU split, both `_core.c` and `_math.c` carried near-
+/// identical `#if PY_VERSION_HEX` ladders. Centralising them here is a
+/// pure mechanical refactor — the macros expand to the same CPython
+/// internal symbols on every supported version, so behaviour is
+/// unchanged. Helper functions are `static inline` so a TU that does
+/// not call (e.g.) `xidata_supported` does not emit an unused-function
+/// warning.
+
+#ifndef BOCPY_XIDATA_H
+#define BOCPY_XIDATA_H
+
+#define PY_SSIZE_T_CLEAN
+
+#include <Python.h>
+#include <stdbool.h>
+
+#if PY_VERSION_HEX >= 0x030D0000
+#define Py_BUILD_CORE
+#include <internal/pycore_crossinterp.h>
+#endif
+
+#if PY_VERSION_HEX >= 0x030E0000 // 3.14
+
+#define XIDATA_FREE _PyXIData_Free
+#define XIDATA_SET_FREE _PyXIData_SET_FREE
+#define XIDATA_NEW() _PyXIData_New()
+#define XIDATA_NEWOBJECT _PyXIData_NewObject
+#define XIDATA_GETXIDATA(value, xidata)                                        \
+  _PyObject_GetXIDataNoFallback(PyThreadState_GET(), (value), (xidata))
+#define XIDATA_INIT _PyXIData_Init
+#define XIDATA_REGISTERCLASS(type, cb)                                         \
+  _PyXIData_RegisterClass(PyThreadState_GET(), (type),                         \
+                          (_PyXIData_getdata_t){.basic = (cb)})
+#define XIDATA_T _PyXIData_t
+
+static inline bool xidata_supported(PyObject *op) {
+  _PyXIData_getdata_t getdata = _PyXIData_Lookup(PyThreadState_GET(), op);
+  return getdata.basic != NULL || getdata.fallback != NULL;
+}
+
+#elif PY_VERSION_HEX >= 0x030D0000 // 3.13
+
+#define XIDATA_FREE _PyCrossInterpreterData_Free
+#define XIDATA_NEW() _PyCrossInterpreterData_New()
+#define XIDATA_NEWOBJECT _PyCrossInterpreterData_NewObject
+#define XIDATA_GETXIDATA(value, xidata)                                        \
+  _PyObject_GetCrossInterpreterData((value), (xidata))
+#define XIDATA_INIT _PyCrossInterpreterData_Init
+#define XIDATA_REGISTERCLASS(type, cb)                                         \
+  _PyCrossInterpreterData_RegisterClass((type), (crossinterpdatafunc)(cb))
+#define XIDATA_T _PyCrossInterpreterData
+
+static inline void xidata_set_free(XIDATA_T *xidata, void (*freefunc)(void *)) {
+  xidata->free = freefunc;
+}
+
+static inline bool xidata_supported(PyObject *op) {
+  crossinterpdatafunc getdata = _PyCrossInterpreterData_Lookup(op);
+  return getdata != NULL;
+}
+
+#define XIDATA_SET_FREE xidata_set_free
+
+#elif PY_VERSION_HEX >= 0x030C0000 // 3.12
+
+#define XIDATA_NEWOBJECT _PyCrossInterpreterData_NewObject
+#define XIDATA_INIT _PyCrossInterpreterData_Init
+#define XIDATA_GETXIDATA(value, xidata)                                        \
+  _PyObject_GetCrossInterpreterData((value), (xidata))
+#define XIDATA_REGISTERCLASS(type, cb)                                         \
+  _PyCrossInterpreterData_RegisterClass((type), (crossinterpdatafunc)(cb))
+#define XIDATA_T _PyCrossInterpreterData
+
+static inline XIDATA_T *xidata_new(void) {
+  XIDATA_T *xidata = (XIDATA_T *)PyMem_RawMalloc(sizeof(XIDATA_T));
+  xidata->data = NULL;
+  xidata->free = NULL;
+  xidata->interp = -1;
+  xidata->new_object = NULL;
+  xidata->obj = NULL;
+  return xidata;
+}
+
+static inline void xidata_set_free(XIDATA_T *xidata, void (*freefunc)(void *)) {
+  xidata->free = freefunc;
+}
+
+static inline bool xidata_supported(PyObject *op) {
+  crossinterpdatafunc getdata = _PyCrossInterpreterData_Lookup(op);
+  return getdata != NULL;
+}
+
+static inline void xidata_free(void *arg) {
+  XIDATA_T *xidata = (XIDATA_T *)arg;
+  if (xidata->data != NULL) {
+    if (xidata->free != NULL) {
+      xidata->free(xidata->data);
+    }
+    xidata->data = NULL;
+  }
+  Py_CLEAR(xidata->obj);
+  PyMem_RawFree(arg);
+}
+
+#define XIDATA_SET_FREE xidata_set_free
+#define XIDATA_NEW xidata_new
+#define XIDATA_FREE xidata_free
+
+#else
+
+#define BOC_NO_MULTIGIL
+
+#define XIDATA_NEWOBJECT _PyCrossInterpreterData_NewObject
+#define XIDATA_GETXIDATA(value, xidata)                                        \
+  _PyObject_GetCrossInterpreterData((value), (xidata))
+#define XIDATA_REGISTERCLASS(type, cb)                                         \
+  _PyCrossInterpreterData_RegisterClass((type), (crossinterpdatafunc)(cb))
+#define XIDATA_T _PyCrossInterpreterData
+
+static inline void xidata_set_free(XIDATA_T *xidata, void (*freefunc)(void *)) {
+  xidata->free = freefunc;
+}
+
+static inline void xidata_free(void *arg) {
+  XIDATA_T *xidata = (XIDATA_T *)arg;
+  if (xidata->data != NULL) {
+    if (xidata->free != NULL) {
+      xidata->free(xidata->data);
+    }
+    xidata->data = NULL;
+  }
+  Py_CLEAR(xidata->obj);
+  PyMem_RawFree(arg);
+}
+
+static inline XIDATA_T *xidata_new(void) {
+  XIDATA_T *xidata = (XIDATA_T *)PyMem_RawMalloc(sizeof(XIDATA_T));
+  xidata->data = NULL;
+  xidata->free = NULL;
+  xidata->interp = -1;
+  xidata->new_object = NULL;
+  xidata->obj = NULL;
+  return xidata;
+}
+
+static inline void
+xidata_init(XIDATA_T *data, PyInterpreterState *interp, void *shared,
+            PyObject *obj, PyObject *(*new_object)(_PyCrossInterpreterData *)) {
+  assert(data->data == NULL);
+  assert(data->obj == NULL);
+  *data = (_PyCrossInterpreterData){0};
+  data->interp = -1;
+
+  assert(data != NULL);
+  assert(new_object != NULL);
+  data->data = shared;
+  if (obj != NULL) {
+    assert(interp != NULL);
+    data->obj = Py_NewRef(obj);
+  }
+  data->interp = (interp != NULL) ? PyInterpreterState_GetID(interp) : -1;
+  data->new_object = new_object;
+}
+
+#define XIDATA_SET_FREE xidata_set_free
+#define XIDATA_NEW xidata_new
+#define XIDATA_INIT xidata_init
+#define XIDATA_FREE xidata_free
+
+static inline bool xidata_supported(PyObject *op) {
+  crossinterpdatafunc getdata = _PyCrossInterpreterData_Lookup(op);
+  return getdata != NULL;
+}
+
+static inline PyObject *PyErr_GetRaisedException(void) {
+  PyObject *et = NULL;
+  PyObject *ev = NULL;
+  PyObject *tb = NULL;
+  PyErr_Fetch(&et, &ev, &tb);
+  assert(et);
+  PyErr_NormalizeException(&et, &ev, &tb);
+  if (tb != NULL) {
+    PyException_SetTraceback(ev, tb);
+    Py_DECREF(tb);
+  }
+  Py_XDECREF(et);
+
+  return ev;
+}
+
+#endif
+
+#endif // BOCPY_XIDATA_H
diff --git a/test/test_boc.py b/test/test_boc.py
index 9cbbda9..17c3747 100644
--- a/test/test_boc.py
+++ b/test/test_boc.py
@@ -3,11 +3,13 @@
 import functools
 import sys
 import threading
+import traceback
 from typing import NamedTuple
 
+import pytest
+
 from bocpy import Cown, drain, receive, send, start, TIMEOUT, wait, when
 from bocpy._core import CownCapsule
-import pytest
 
 RECEIVE_TIMEOUT = 10
 
@@ -99,6 +101,22 @@ def do_div0(x: Cown):
     return do_div0
 
 
+class RaiseOnUnpickle:
+    """Pickles cleanly but raises ZeroDivisionError when unpickled.
+
+    Used to drive the deserialisation-failure path inside
+    ``cown_acquire``. The ``__reduce__`` protocol stores
+    ``(eval, ("1/0",))``; ``eval`` is a builtin so the bytestream is
+    portable across sub-interpreters, and ``eval("1/0")`` raises
+    ``ZeroDivisionError`` when ``pickle.loads`` is called inside the
+    worker's ``cown_acquire``.
+    """
+
+    def __reduce__(self):
+        """Return a reduce tuple whose loader raises on unpickle."""
+        return (eval, ("1/0",))
+
+
 class Fork:
     """Simple fork that tracks usage and remaining hunger."""
 
@@ -602,6 +620,125 @@ def _(r):
         receive_asserts()
 
 
+class TestCownAcquireDeserialiseFailure:
+    """``cown_acquire`` rolls back owner on unpickle failure.
+
+    When ``xidata_to_object`` (which calls ``_PyPickle_Loads``) raises,
+    ``cown_acquire`` previously returned -1 with the cown left in a
+    half-acquired ``(owner=worker, value=NULL, xidata!=NULL)`` state.
+    The worker-side recovery arm in ``run_behavior`` then called
+    ``behavior.release()``, whose ``cown_release`` aborts on
+    ``assert(cown->value != NULL)`` (debug build) or NULL-derefs in
+    ``object_to_xidata`` (release build).
+
+    The fix stores ``NO_OWNER`` back into ``cown->owner`` before
+    returning -1, so the recovery arm's ``cown_release`` short-circuits
+    cleanly via the ``owner == NO_OWNER`` branch and the result Cown
+    surfaces the exception to downstream behaviors.
+    """
+
+    @classmethod
+    def teardown_class(cls):
+        """Ensure runtime is drained after suite."""
+        wait()
+
+    def test_acquire_rollback_surfaces_exception(self):
+        """Acquire failure produces a result Cown with .exception True.
+
+        The first behavior ``use_bad`` is scheduled against a Cown wrapping
+        an instance of :class:`RaiseOnUnpickle`. When the worker dequeues
+        the behavior and calls ``cown_acquire``, ``_PyPickle_Loads`` raises
+        ``ZeroDivisionError``. The worker's recovery arm marks ``use_bad``'s
+        result Cown with the exception. The downstream behavior ``check``
+        observes ``b.exception is True``.
+
+        Without this rollback, ``cown_release`` aborts before
+        ``check`` is ever scheduled, the assert messages never
+        arrive, and the test either segfaults or times out.
+        """
+        bad = Cown(RaiseOnUnpickle())
+
+        @when(bad)
+        def use_bad(b):
+            # This body never runs — acquire fails first.
+            send("assert", (b.value, "unreachable"))
+
+        @when(use_bad)
+        def check(b):
+            send("assert", (b.exception, True))
+            send("assert", (isinstance(b.value, ZeroDivisionError), True))
+
+        receive_asserts(2)
+
+
+class TestBehaviorCapsuleArgsSize:
+    """``BehaviorCapsule`` ``group_ids`` allocation corner cases.
+
+    ``BehaviorCapsule_init`` allocates ``behavior->group_ids`` via
+    ``PyMem_RawCalloc(args_size, sizeof(int))``. Two corner cases must
+    work:
+
+    * ``args_size == 0`` -- ``PyMem_RawCalloc`` may legally return NULL
+      for a zero-element request, so the NULL check must be guarded
+      ``args_size > 0``.
+    * ``args_size > 0`` with a successful allocation -- the standard
+      path; verifies the gating logic does not regress normal use.
+
+    OOM injection for the failure path requires allocator hooks that
+    do not exist in the test infrastructure today.
+    """
+
+    @classmethod
+    def teardown_class(cls):
+        """Ensure runtime is drained after suite."""
+        wait()
+
+    def test_zero_args_behavior_capsule(self):
+        """BehaviorCapsule with empty args list must construct cleanly."""
+        from bocpy import start as _start_runtime
+        from bocpy._core import BehaviorCapsule
+        try:
+            _start_runtime()
+        except RuntimeError:
+            pass  # Runtime already started by a prior test.
+
+        result = Cown(None)
+        # Empty args list — args_size == 0. The
+        # ``args_size > 0 && group_ids == NULL`` guard avoids a
+        # spurious failure if PyMem_RawCalloc(0, ...) returns NULL.
+        capsule = BehaviorCapsule(
+            "__behavior_zero_args__",
+            result.impl,
+            [],
+            [],
+        )
+        assert capsule is not None
+
+    def test_large_args_behavior_capsule(self):
+        """BehaviorCapsule with many args constructs and group_ids works."""
+        from bocpy import start as _start_runtime
+        from bocpy._core import BehaviorCapsule
+        try:
+            _start_runtime()
+        except RuntimeError:
+            pass  # Runtime already started by a prior test.
+
+        result = Cown(None)
+        # 32 distinct cowns with distinct group_ids. Exercises the
+        # group_ids[i] = group_id loop that NULL-derefs without
+        # the alloc check on OOM.
+        cowns = [Cown(i) for i in range(32)]
+        args = [(i, c.impl) for i, c in enumerate(cowns)]
+
+        capsule = BehaviorCapsule(
+            "__behavior_large_args__",
+            result.impl,
+            args,
+            [],
+        )
+        assert capsule is not None
+
+
 class TestExceptionFlag:
     """Tests for the Cown.exception flag distinguishing thrown vs returned."""
 
@@ -861,3 +998,308 @@ def _probe(c):
             assert isinstance(observed, CownCapsule), (
                 f"slot {idx} returned {type(observed).__name__}, "
                 "expected CownCapsule")
+
+
+class TestInMemoryExport:
+    """Regression tests for the in-memory transpiler export path.
+
+    Prior to the in-memory export path the transpiled module was
+    written to a temporary directory under ``tempfile.mkdtemp()``
+    and re-read by every worker via
+    ``importlib.util.spec_from_file_location``. That path had three problems: a world-traversable on-disk artifact, a
+    small TOCTOU window between write and per-worker read, and an
+    f-string interpolation of ``module_name`` into ``r"..."`` that
+    re-opened a code-injection vector if a hostile name reached
+    ``start()``.
+
+    The replacement embeds the transpiled source as a Python string
+    literal (via ``repr()``) inside the per-worker bootstrap, exec's
+    it into a fresh ``types.ModuleType``, and registers a
+    ``linecache`` entry under a synthetic filename
+    ``<bocpy:NAME>`` so tracebacks still point at the transpiled
+    source line. These tests exercise the surfaces that change.
+    """
+
+    @classmethod
+    def teardown_class(cls):
+        """Ensure the runtime is drained between this and the next class."""
+        wait()
+
+    def test_traceback_resolves_via_linecache(self):
+        """A raising body's traceback shows the transpiled source line.
+
+        The ``linecache`` registration in the worker bootstrap is the
+        only thing keeping tracebacks debuggable now that the source
+        is no longer on disk. We capture a worker-side traceback
+        string via ``traceback.format_exc()`` and assert it references
+        the synthetic ``<bocpy:NAME>`` filename — proof the bootstrap
+        registered the cache entry under that name.
+        """
+        c = Cown(0)
+        start(worker_count=2)
+        try:
+            @when(c)
+            def _b(c):  # noqa: B023
+                try:
+                    raise RuntimeError("synthetic-from-test-traceback")
+                except RuntimeError:
+                    send("tb_done", traceback.format_exc())
+            tag, tb_str = receive(["tb_done"], RECEIVE_TIMEOUT)
+            assert tag != TIMEOUT, "traceback probe timed out"
+        finally:
+            drain("tb_done")
+            wait()
+
+        # The traceback must reference the synthetic bootstrap
+        # filename ``<bocpy:__bocmain__>`` (the test module is the
+        # worker's __main__ alias).
+        assert "<bocpy:" in tb_str, (
+            f"traceback did not reference synthetic filename; got:\n{tb_str}"
+        )
+
+    def test_tricky_source_round_trips(self):
+        """Tricky literals (Unicode, backslashes, triple-quotes) survive.
+
+        ``repr()`` is the source-of-truth for embedding the transpiled
+        text into the worker bootstrap. This test puts every embedding
+        hazard we can think of into a single behavior body and
+        confirms it executes correctly.
+        """
+        c = Cown(0)
+        start(worker_count=2)
+        try:
+            @when(c)
+            def _(c):  # noqa: B023
+                # 1. Non-ASCII identifier-class literal
+                # 2. Embedded quotes of every flavour
+                # 3. Triple-quoted string literal
+                # 4. Backslash and raw-string-style content
+                # 5. Surrogate-free Unicode (U+1F600 grinning face)
+                # 6. NUL byte in a literal — repr() must escape it
+                payload = (
+                    "héllo",
+                    'mix "single" and \'double\' quotes',
+                    """triple-quoted with embedded "quote" and 'apostrophe'""",
+                    r"raw \n not a newline",
+                    'back\\slash and "escaped quote"',
+                    "emoji \U0001F600 in literal",
+                    "with\x00nul",
+                )
+                send("tricky_done", payload)
+            tag, payload = receive(["tricky_done"], RECEIVE_TIMEOUT)
+            assert tag != TIMEOUT, "tricky-source probe timed out"
+            assert payload == (
+                "héllo",
+                'mix "single" and \'double\' quotes',
+                """triple-quoted with embedded "quote" and 'apostrophe'""",
+                r"raw \n not a newline",
+                'back\\slash and "escaped quote"',
+                "emoji \U0001F600 in literal",
+                "with\x00nul",
+            ), f"payload round-trip mismatch: {payload!r}"
+        finally:
+            drain("tricky_done")
+            wait()
+
+    def test_module_name_with_quote_rejected(self):
+        """``module_name`` containing a double-quote is rejected at start().
+
+        Defence in depth: even though every interpolation now uses
+        ``repr()``, ``Behaviors.start`` validates ``module_name`` is
+        a dotted Python module path before building the bootstrap
+        snippet. A name with a quote would ``repr()`` cleanly but
+        is still nonsensical and the boundary check refuses it with
+        a ``ValueError``.
+        """
+        # Reach Behaviors.start directly so we can pass an arbitrary
+        # module name. We cannot use the public ``bocpy.start()``
+        # entry point because it overrides ``module`` from the
+        # caller's frame.
+        from bocpy import behaviors as _behaviors
+
+        wait()  # ensure no live runtime
+        b = _behaviors.Behaviors(2)
+        # Provide a path that exists so export_module_from_file does not
+        # raise on FileNotFoundError before reaching the validation.
+        # The transpiler will parse this test file itself; the body
+        # never runs because the validation fires first.
+        with pytest.raises(ValueError, match="dotted Python module path"):
+            b.start(module=('a"b', __file__))
+
+
+# ---------------------------------------------------------------------------
+# NaN/Inf timeout helper
+# ---------------------------------------------------------------------------
+
+
+class TestTimeoutValidation:
+    """Boundary validation for wait/notice_sync_wait timeouts.
+
+    The C-level ``boc_validate_finite_timeout`` helper rejects NaN with
+    ``ValueError``, treats ``+Inf`` as "wait forever", and clamps
+    negatives to 0 (return immediately). Without it NaN would compute a
+    nonsensical ``ms`` argument to the OS timed-wait primitive (UB on
+    Windows, wedge-forever on POSIX).
+    """
+
+    @classmethod
+    def teardown_class(cls):
+        wait()
+
+    def test_terminator_wait_nan_timeout_raises_value_error(self):
+        """NaN timeout to ``_core.terminator_wait`` raises ``ValueError``."""
+        from bocpy import _core
+        with pytest.raises(ValueError, match="NaN"):
+            _core.terminator_wait(float("nan"))
+
+    def test_notice_sync_wait_nan_timeout_raises_value_error(self):
+        """NaN timeout to ``_core.notice_sync_wait`` raises ``ValueError``."""
+        from bocpy import _core
+        with pytest.raises(ValueError, match="NaN"):
+            _core.notice_sync_wait(0, float("nan"))
+
+    def test_wait_inf_timeout_blocks_until_done(self):
+        """``+Inf`` timeout treats wait as "wait forever" and returns once done.
+
+        With no live behaviors the terminator count is already 0, so
+        ``terminator_wait(+Inf)`` returns ``True`` immediately rather
+        than blocking. The point is that it does *not* raise.
+        """
+        from bocpy import _core
+        # No runtime has incremented the terminator, so this returns at
+        # once. The test exists to assert +Inf is accepted (not ValueError).
+        assert _core.terminator_wait(float("inf")) is True
+
+    def test_terminator_wait_negative_timeout_returns_immediately(self):
+        """Negative timeout to ``_core.terminator_wait`` is mapped to wait_forever.
+
+        bocpy's existing convention treats negatives as "wait forever"
+        (matching the historical Python-side semantics). The new
+        validator preserves that behaviour for negatives — only NaN is
+        upgraded to a hard error. With no live runtime the terminator
+        is already at 0, so this returns immediately either way.
+        """
+        from bocpy import _core
+        # Returns True immediately because count is already 0.
+        assert _core.terminator_wait(-1.0) is True
+
+
+# ---------------------------------------------------------------------------
+# BaseException discipline
+# ---------------------------------------------------------------------------
+
+
+class TestBaseExceptionDiscipline:
+    """KeyboardInterrupt in a @when body releases the cown.
+
+    Without ``finally``-based cleanup, ``except Exception`` arms in
+    ``worker.py`` and the orphan-drain loop in ``behaviors.py``
+    silently let ``KeyboardInterrupt`` / ``SystemExit`` escape past
+    the per-iteration cleanup. The MCS chain would stay linked, the
+    cown would stay owned, and every successor on it would strand.
+    """
+
+    @classmethod
+    def teardown_class(cls):
+        wait()
+        drain("ki_done")
+
+    def test_keyboard_interrupt_during_worker_releases_cown(self):
+        """A ``KeyboardInterrupt`` from a @when body releases the cown.
+
+        Schedules a behavior that raises ``KeyboardInterrupt``, then
+        a follow-on behavior on the same cown. If the
+        ``finally``-based release / release_all chain is wired
+        correctly, the follow-on runs and the test sees its message.
+        Otherwise the cown is stranded and ``receive`` times out.
+        """
+        wait()
+        start(worker_count=2)
+        try:
+            c = Cown(0)
+
+            @when(c)
+            def _raise(c):
+                raise KeyboardInterrupt("intentional KI")
+
+            @when(c)
+            def _follow(c):
+                send("ki_done", "ok")
+
+            tag, payload = receive("ki_done", RECEIVE_TIMEOUT)
+            assert tag != TIMEOUT, (
+                "follow-on never ran -- cown was not released after KI"
+            )
+            assert payload == "ok"
+        finally:
+            drain("ki_done")
+            wait()
+
+    def test_keyboard_interrupt_during_orphan_drain_completes_drain(self):
+        """KI mid-drain still drains the remaining orphans.
+
+        Patches ``BehaviorCapsule.set_drop_exception`` so the first
+        orphan raises ``KeyboardInterrupt`` (mimicking a Ctrl-C landing
+        inside the drain loop). The drain must finish the remaining
+        orphans before the deferred KI is re-raised, so no MCS chain or
+        terminator hold leaks.
+        """
+        from unittest import mock
+
+        from bocpy import behaviors as _behaviors
+
+        wait()
+        # Build a Behaviors directly so we can drive _drain_orphan_behaviors
+        # against synthetic capsules without standing up the full runtime.
+        b = _behaviors.Behaviors(2)
+
+        # Synthetic capsule that records its release_all call. We do
+        # NOT actually inject these into the C scheduler queue; instead
+        # we monkey-patch `_core.scheduler_drain_all_queues` to return
+        # them, and patch `_core.terminator_dec` to be a no-op so the
+        # test does not touch global C state.
+        class FakeCapsule:
+            def __init__(self):
+                self.set_drop_called = False
+                self.released = False
+
+            def set_drop_exception(self, exc):
+                self.set_drop_called = True
+
+            def release_all(self):
+                self.released = True
+
+        cap_ki = FakeCapsule()
+        cap_ok = FakeCapsule()
+
+        # First call returns both capsules; second call returns [] so
+        # the drain loop terminates cleanly.
+        drain_returns = [[cap_ki, cap_ok], []]
+
+        def fake_drain():
+            return drain_returns.pop(0) if drain_returns else []
+
+        # Make set_drop_exception on cap_ki raise KI; cap_ok works normally.
+        original_set_drop = FakeCapsule.set_drop_exception
+
+        def patched_set_drop(self, exc):
+            if self is cap_ki:
+                raise KeyboardInterrupt("orphan-drain KI")
+            return original_set_drop(self, exc)
+
+        with mock.patch.object(FakeCapsule, "set_drop_exception",
+                               patched_set_drop), \
+             mock.patch("bocpy._core.scheduler_drain_all_queues",
+                        side_effect=fake_drain), \
+             mock.patch("bocpy._core.terminator_dec", return_value=0):
+            with pytest.raises(KeyboardInterrupt, match="orphan-drain KI"):
+                b._drain_orphan_behaviors()
+
+        # cap_ok must still have had its release_all called -- the KI on
+        # cap_ki did not abort the drain partway.
+        assert cap_ok.released, (
+            "second orphan was not drained -- KI aborted the loop"
+        )
+        # cap_ki's release_all was attempted too (the KI was raised
+        # from set_drop_exception, which runs *before* release_all).
+        assert cap_ki.released
diff --git a/test/test_compat_atomics.py b/test/test_compat_atomics.py
new file mode 100644
index 0000000..080759f
--- /dev/null
+++ b/test/test_compat_atomics.py
@@ -0,0 +1,195 @@
+"""Tests for the typed `boc_atomic_*_explicit` API in `compat.h`.
+
+These tests drive the C extension `bocpy._internal_test` (atomics
+domain, `atomics_*` methods) from real Python threads. On
+free-threaded CPython (.env313t / .env315t) the threads run truly
+in parallel; on regular CPython the C functions release the GIL
+across their hot loops via `Py_BEGIN_ALLOW_THREADS`, so the
+producer/consumer handshake and CAS contention loops still race.
+
+On x86/x64 these tests are smoke tests (every Interlocked* on those
+architectures is a full barrier). On ARM64 they are the canonical
+weak-memory correctness tests for the `__ldar*`/`__stlr*` and
+`Interlocked*_{nf,acq,rel}` dispatch in `compat.h`.
+"""
+
+import threading
+from types import SimpleNamespace
+
+import pytest
+
+_it = pytest.importorskip(
+    "bocpy._internal_test",
+    reason="internal test extension not built (set BOCPY_BUILD_INTERNAL_TESTS=1 and reinstall)",
+)
+
+# Bind the atomics-domain methods under the historical `ca.*` name so
+# the body of this file stays readable and untouched.
+ca = SimpleNamespace(
+    make_state=_it.atomics_make_state,
+    reset=_it.atomics_reset,
+    load_counter64=_it.atomics_load_counter64,
+    load_counter32=_it.atomics_load_counter32,
+    load_bool=_it.atomics_load_bool,
+    load_ptr=_it.atomics_load_ptr,
+    producer=_it.atomics_producer,
+    consumer=_it.atomics_consumer,
+    fetch_add_loop_u64=_it.atomics_fetch_add_loop_u64,
+    fetch_add_loop_u32=_it.atomics_fetch_add_loop_u32,
+    cas_increment_loop_u64=_it.atomics_cas_increment_loop_u64,
+    round_trip=_it.atomics_round_trip,
+)
+
+
+def test_round_trip_single_thread():
+    """Every (op, type, order) returns the right value at least once."""
+    ca.round_trip()
+
+
+def test_state_starts_zeroed():
+    h = ca.make_state()
+    assert ca.load_counter64(h) == 0
+    assert ca.load_counter32(h) == 0
+    assert ca.load_bool(h) is False
+    assert ca.load_ptr(h) == 0
+
+
+def test_reset_zeros_all_slots():
+    h = ca.make_state()
+    ca.fetch_add_loop_u64(h, 5)
+    ca.fetch_add_loop_u32(h, 3)
+    assert ca.load_counter64(h) == 5
+    assert ca.load_counter32(h) == 3
+    ca.reset(h)
+    assert ca.load_counter64(h) == 0
+    assert ca.load_counter32(h) == 0
+
+
+# ---------------------------------------------------------------------------
+# Acquire / release handshake
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.parametrize("payload", [
+    1,
+    0xDEADBEEF,
+    0xCAFEBABEDEADBEEF,
+    0xFFFFFFFFFFFFFFFF,
+])
+def test_handshake_single(payload):
+    """Single producer/consumer round; consumer must see the payload."""
+    h = ca.make_state()
+    result = []
+
+    def consume():
+        result.append(ca.consumer(h))
+
+    t = threading.Thread(target=consume)
+    t.start()
+    ca.producer(h, payload)
+    t.join(timeout=5.0)
+    assert not t.is_alive(), "consumer thread did not observe release-store"
+    assert result == [payload]
+
+
+def test_handshake_repeated():
+    """Repeat the handshake many times; every iteration reads its payload.
+
+    This is the canonical message-passing weak-memory test: a relaxed
+    load on `flag` would let the consumer observe `flag==1` while the
+    prior `payload` write is still in the producer's store buffer.
+    """
+    iters = 2000
+    for i in range(iters):
+        h = ca.make_state()
+        payload = 0xA5A5_0000_0000_0000 | i
+        result = []
+
+        def consume(h=h, result=result):
+            result.append(ca.consumer(h))
+
+        t = threading.Thread(target=consume)
+        t.start()
+        ca.producer(h, payload)
+        t.join(timeout=5.0)
+        assert not t.is_alive(), f"iteration {i}: consumer hung"
+        assert result == [payload], f"iteration {i}: expected {payload:#x}, got {result[0]:#x}"
+
+
+# ---------------------------------------------------------------------------
+# Multi-thread fetch_add contention
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.parametrize("threads,per_thread", [
+    (2, 50_000),
+    (4, 50_000),
+    (8, 25_000),
+])
+def test_fetch_add_u64_contention(threads, per_thread):
+    """Sum `threads * per_thread` fetch-adds; assert no lost updates.
+
+    N threads each `fetch_add(+1)` `per_thread` times; final
+    counter must equal `threads * per_thread`. A non-atomic
+    increment would lose updates under contention.
+    """
+    h = ca.make_state()
+    workers = [
+        threading.Thread(target=ca.fetch_add_loop_u64, args=(h, per_thread))
+        for _ in range(threads)
+    ]
+    for w in workers:
+        w.start()
+    for w in workers:
+        w.join(timeout=30.0)
+        assert not w.is_alive()
+    assert ca.load_counter64(h) == threads * per_thread
+
+
+@pytest.mark.parametrize("threads,per_thread", [
+    (2, 50_000),
+    (4, 50_000),
+])
+def test_fetch_add_u32_contention(threads, per_thread):
+    h = ca.make_state()
+    workers = [
+        threading.Thread(target=ca.fetch_add_loop_u32, args=(h, per_thread))
+        for _ in range(threads)
+    ]
+    for w in workers:
+        w.start()
+    for w in workers:
+        w.join(timeout=30.0)
+        assert not w.is_alive()
+    assert ca.load_counter32(h) == threads * per_thread
+
+
+# ---------------------------------------------------------------------------
+# Multi-thread CAS contention
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.parametrize("threads,per_thread", [
+    (2, 25_000),
+    (4, 25_000),
+    (8, 10_000),
+])
+def test_cas_increment_contention(threads, per_thread):
+    """Sum `threads * per_thread` CAS-increments; assert none lost.
+
+    N threads each CAS-increment counter `per_thread` times;
+    success path uses `BOC_MO_ACQ_REL`. Final value must equal
+    `threads * per_thread` — any lost CAS would short the count.
+    """
+    h = ca.make_state()
+    workers = [
+        threading.Thread(target=ca.cas_increment_loop_u64,
+                         args=(h, per_thread))
+        for _ in range(threads)
+    ]
+    for w in workers:
+        w.start()
+    for w in workers:
+        w.join(timeout=60.0)
+        assert not w.is_alive()
+    assert ca.load_counter64(h) == threads * per_thread
diff --git a/test/test_internal_mpmcq.py b/test/test_internal_mpmcq.py
new file mode 100644
index 0000000..6722310
--- /dev/null
+++ b/test/test_internal_mpmcq.py
@@ -0,0 +1,196 @@
+"""Tests for the Verona MPMC behaviour queue (`boc_bq_*`).
+
+These exercise the C-level queue exposed via ``bocpy._internal_test``
+(prefix ``bq_*``). The queue is a port of Verona's ``MPMCQ<T>`` from
+``mpmcq.h``; we test it in isolation, decoupled from any production
+caller.
+"""
+
+from __future__ import annotations
+
+import threading
+
+import pytest
+
+bq = pytest.importorskip(
+    "bocpy._internal_test",
+    reason="internal test extension not built (set BOCPY_BUILD_INTERNAL_TESTS=1 and reinstall)",
+)
+
+
+# ---------------------------------------------------------------------------
+# Single-threaded sanity
+# ---------------------------------------------------------------------------
+
+
+def test_empty_on_construction_and_after_drain():
+    """A fresh queue is empty, and remains empty after a drain cycle."""
+    q = bq.bq_make_queue()
+    assert bq.bq_is_empty(q)
+
+    nodes = [bq.bq_make_node(i) for i in range(8)]
+    for n in nodes:
+        bq.bq_enqueue(q, n)
+    assert not bq.bq_is_empty(q)
+
+    seen = []
+    while True:
+        got = bq.bq_dequeue(q)
+        if got is None:
+            break
+        seen.append(got)
+    assert seen == list(range(8))
+    assert bq.bq_is_empty(q)
+
+
+def test_fifo_single_thread():
+    """Single-thread enqueue / dequeue preserves FIFO order."""
+    q = bq.bq_make_queue()
+    nodes = [bq.bq_make_node(i) for i in range(100)]
+    for n in nodes:
+        bq.bq_enqueue(q, n)
+    out = [bq.bq_dequeue(q) for _ in range(100)]
+    assert out == list(range(100))
+    assert bq.bq_dequeue(q) is None
+
+
+def test_dequeue_on_empty_returns_none():
+    q = bq.bq_make_queue()
+    assert bq.bq_dequeue(q) is None
+    assert bq.bq_dequeue_all(q) == []
+
+
+def test_enqueue_front_on_empty_then_dequeue():
+    """enqueue_front on an empty queue routes to the back path."""
+    q = bq.bq_make_queue()
+    n = bq.bq_make_node(42)
+    bq.bq_enqueue_front(q, n)
+    assert not bq.bq_is_empty(q)
+    assert bq.bq_dequeue(q) == 42
+    assert bq.bq_is_empty(q)
+
+
+def test_enqueue_front_orders_before_existing():
+    """A node pushed via enqueue_front comes out before existing items."""
+    q = bq.bq_make_queue()
+    keep = [bq.bq_make_node(i) for i in range(3)]
+    for n in keep:
+        bq.bq_enqueue(q, n)
+    head = bq.bq_make_node(99)
+    bq.bq_enqueue_front(q, head)
+    out = []
+    while True:
+        v = bq.bq_dequeue(q)
+        if v is None:
+            break
+        out.append(v)
+    assert out == [99, 0, 1, 2]
+
+
+def test_dequeue_all_returns_fifo_segment():
+    """dequeue_all returns every currently-enqueued node in FIFO order."""
+    q = bq.bq_make_queue()
+    nodes = [bq.bq_make_node(i) for i in range(50)]
+    for n in nodes:
+        bq.bq_enqueue(q, n)
+    seg = bq.bq_dequeue_all(q)
+    assert seg == list(range(50))
+    assert bq.bq_is_empty(q)
+
+
+# ---------------------------------------------------------------------------
+# Multi-producer stress
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.parametrize("producers,per_producer", [(8, 20_000)])
+def test_mpmc_stress_no_loss_no_dup(producers, per_producer):
+    """Many producers, two consumers (one dequeue + one dequeue_all loop).
+
+    With ``producers * per_producer`` enqueues split across encoded
+    producer IDs, every value must appear exactly once on the consumer
+    side. ``producers=8, per_producer=2000`` already exceeds 10^4 ops;
+    raise ``per_producer`` to push past 10^6 when stress-bumping
+    locally.
+    """
+    total = producers * per_producer
+    q = bq.bq_make_queue()
+
+    # Pre-allocate every node up front (alloc under GIL is not what we
+    # want to stress). Encode (producer_id, sequence) in a single int
+    # so the consumer side can verify per-producer FIFO ordering.
+    nodes = [
+        [bq.bq_make_node(p * per_producer + i) for i in range(per_producer)]
+        for p in range(producers)
+    ]
+
+    seen: list[int] = []
+    seen_lock = threading.Lock()
+    stop = threading.Event()
+    expected_total = total
+
+    def producer(pid: int) -> None:
+        for n in nodes[pid]:
+            bq.bq_enqueue(q, n)
+
+    def dequeue_consumer() -> None:
+        while not stop.is_set():
+            v = bq.bq_dequeue(q)
+            if v is None:
+                continue
+            with seen_lock:
+                seen.append(v)
+
+    def dequeue_all_consumer() -> None:
+        while not stop.is_set():
+            chunk = bq.bq_dequeue_all(q)
+            if chunk:
+                with seen_lock:
+                    seen.extend(chunk)
+
+    prods = [threading.Thread(target=producer, args=(p,))
+             for p in range(producers)]
+    cons1 = threading.Thread(target=dequeue_consumer)
+    cons2 = threading.Thread(target=dequeue_all_consumer)
+    cons1.start()
+    cons2.start()
+    for t in prods:
+        t.start()
+    for t in prods:
+        t.join()
+
+    # Drain remainder under stop signal.
+    # Spin until consumers report all values seen, then stop them.
+    import time
+    deadline = time.monotonic() + 30.0
+    while time.monotonic() < deadline:
+        with seen_lock:
+            if len(seen) >= expected_total:
+                break
+        time.sleep(0.005)
+    stop.set()
+    cons1.join()
+    cons2.join()
+
+    # Final mop-up in case the consumer threads exited mid-segment.
+    while True:
+        v = bq.bq_dequeue(q)
+        if v is None:
+            break
+        seen.append(v)
+
+    assert len(seen) == expected_total, (
+        f"lost or duplicated values: got {len(seen)}, expected {expected_total}"
+    )
+    assert sorted(seen) == list(range(expected_total)), (
+        "values do not form 0..N-1 — duplication or corruption"
+    )
+
+    # Note: we deliberately do NOT assert per-producer FIFO on `seen`.
+    # Even though MPMCQ preserves enqueue order at the dequeue point,
+    # `seen` is appended under a lock by two concurrent consumers, so
+    # its order reflects lock-acquisition order, not dequeue order.
+    # The invariant under test is that every value appears exactly
+    # once — no losses, no duplicates.
+
+    assert bq.bq_is_empty(q)
diff --git a/test/test_internal_wsq.py b/test/test_internal_wsq.py
new file mode 100644
index 0000000..dbd1cdc
--- /dev/null
+++ b/test/test_internal_wsq.py
@@ -0,0 +1,124 @@
+"""Unit tests for the inline ``boc_wsq_*`` helpers in ``sched.h``.
+
+These tests exercise the work-stealing-queue cursor arithmetic and
+``enqueue_spread`` distribution invariant directly via the
+``bocpy._internal_test`` C shim. They stay below the dispatch /
+steal layer; full-stack scheduling correctness is covered by the
+existing ``test_scheduler_*`` and ``test_boc.py`` suites.
+"""
+
+import pytest
+
+_it = pytest.importorskip(
+    "bocpy._internal_test",
+    reason="internal test extension not built (set BOCPY_BUILD_INTERNAL_TESTS=1 and reinstall)",
+)
+
+
+WSQ_N = _it.wsq_n()
+
+
+# ---------------------------------------------------------------------------
+# Cursor arithmetic
+# ---------------------------------------------------------------------------
+
+
+def test_pre_inc_uniform_over_full_cycles():
+    """`boc_wsq_pre_inc` must distribute uniformly over k = N * K calls."""
+    K = 1000  # noqa: N806
+    counts = _it.wsq_pre_inc_histogram(WSQ_N * K)
+    assert counts == [K] * WSQ_N, (
+        f"non-uniform distribution: {counts}")
+
+
+def test_pre_inc_first_indices():
+    """First N pre-increments must visit indices 1, 2, ..., N-1, 0."""
+    counts = _it.wsq_pre_inc_histogram(WSQ_N)
+    # Every index hit exactly once over a full cycle (regardless of order).
+    assert counts == [1] * WSQ_N
+
+
+def test_pre_inc_partial_cycle_within_bounds():
+    """A partial cycle hits a contiguous prefix of indices."""
+    # k = N - 1: indices 1..N-1 each receive 1, index 0 receives 0.
+    counts = _it.wsq_pre_inc_histogram(WSQ_N - 1)
+    assert counts[0] == 0
+    for i in range(1, WSQ_N):
+        assert counts[i] == 1, f"index {i} received {counts[i]}"
+
+
+def test_post_dec_first_returns_zero_then_wraps():
+    """`boc_wsq_post_dec` returns the *pre*-decrement index."""
+    seq = _it.wsq_post_dec_sequence(WSQ_N + 2)
+    # First call: cursor was 0 -> returns 0, advances to N-1.
+    assert seq[0] == 0
+    # Then N-1, N-2, ..., 0 (wrap), N-1, N-2.
+    expected = [0] + list(range(WSQ_N - 1, -1, -1)) + [WSQ_N - 1]
+    assert seq == expected[: len(seq)]
+
+
+# ---------------------------------------------------------------------------
+# Single-node enqueue distribution
+# ---------------------------------------------------------------------------
+
+
+def test_enqueue_round_robin_full_cycles():
+    """N*K single pushes hit every sub-queue exactly K times."""
+    K = 256  # noqa: N806
+    w = _it.wsq_make_worker()
+    counts = _it.wsq_enqueue_drain_counts(w, WSQ_N * K)
+    assert counts == [K] * WSQ_N, (
+        f"enqueue did not round-robin uniformly: {counts}")
+
+
+def test_enqueue_partial_cycle_distribution():
+    """A non-multiple-of-N push count distributes within ±1 across sub-queues."""
+    K = 7  # noqa: N806  7 pushes, N=4 -> [1, 2, 2, 2] in some rotation.
+    w = _it.wsq_make_worker()
+    counts = _it.wsq_enqueue_drain_counts(w, K)
+    assert sum(counts) == K
+    # Max-min must be <= 1: round-robin gives near-uniform.
+    assert max(counts) - min(counts) <= 1
+
+
+def test_enqueue_zero_pushes_leaves_all_empty():
+    """Zero pushes leaves every sub-queue empty."""
+    w = _it.wsq_make_worker()
+    counts = _it.wsq_enqueue_drain_counts(w, 0)
+    assert counts == [0] * WSQ_N
+
+
+# ---------------------------------------------------------------------------
+# enqueue_spread distribution invariant
+# ---------------------------------------------------------------------------
+
+
+def test_spread_preserves_total_count():
+    """All L nodes from a stolen segment land somewhere across the WSQ."""
+    for length in (1, 2, 3, WSQ_N, WSQ_N + 1, 4 * WSQ_N, 100):
+        w = _it.wsq_make_worker()
+        counts = _it.wsq_spread_segment_counts(w, length)
+        assert sum(counts) == length, (
+            f"length={length}: spread lost nodes: {counts}")
+
+
+def test_spread_distributes_long_segment_uniformly():
+    """A long segment fills every sub-queue (no sub-queue is starved)."""
+    length = 4 * WSQ_N
+    w = _it.wsq_make_worker()
+    counts = _it.wsq_spread_segment_counts(w, length)
+    assert sum(counts) == length
+    # Every sub-queue must receive at least one node.
+    assert all(c >= 1 for c in counts), (
+        f"some sub-queue starved: {counts}")
+    # Spread is near-uniform: max-min <= 1 for an exact multiple of N.
+    assert max(counts) - min(counts) <= 1, (
+        f"long-segment spread non-uniform: {counts}")
+
+
+def test_spread_singleton_segment_lands_on_one_subqueue():
+    """A length-1 segment results in a single sub-queue holding 1 node."""
+    w = _it.wsq_make_worker()
+    counts = _it.wsq_spread_segment_counts(w, 1)
+    assert sum(counts) == 1
+    assert counts.count(1) == 1
diff --git a/test/test_matrix.py b/test/test_matrix.py
index 963206d..7a0c07d 100644
--- a/test/test_matrix.py
+++ b/test/test_matrix.py
@@ -3,9 +3,10 @@
 import math
 import random
 
-from bocpy import Cown, Matrix
 import pytest
 
+from bocpy import Cown, Matrix
+
 
 # ---------------------------------------------------------------------------
 # Fixtures – fuzzed inputs covering a range of matrix sizes
diff --git a/test/test_message_queue.py b/test/test_message_queue.py
index 0449ec7..01f2d0a 100644
--- a/test/test_message_queue.py
+++ b/test/test_message_queue.py
@@ -20,9 +20,10 @@
 import threading
 import time
 
-from bocpy import drain, receive, send, set_tags, TIMEOUT
 import pytest
 
+from bocpy import drain, receive, send, set_tags, TIMEOUT
+
 
 # ---------------------------------------------------------------------------
 # Constants
@@ -269,6 +270,44 @@ def test_receive_non_str_in_tag_list(self):
         with pytest.raises(TypeError):
             receive([123], 0)
 
+    def test_send_unpaired_surrogate_tag_no_leak(self):
+        """send() with a tag containing an unpaired surrogate fails cleanly.
+
+        Regression test: ``tag_from_PyUnicode`` previously leaked
+        the @c BOCTag struct when @c PyUnicode_AsUTF8AndSize
+        raised on surrogate input. After the fix the partial
+        allocation is freed before the function returns NULL, and
+        the caller (``boc_message_new`` / ``get_queue_for_tag``)
+        propagates the @c UnicodeEncodeError without wedging the
+        slot in @c ASSIGNED-with-NULL-tag state. We then prove the
+        slot is still usable by sending a normal tag through the
+        queue afterwards.
+        """
+        bad_tag = "\ud800"  # lone high surrogate
+        with pytest.raises(UnicodeEncodeError):
+            send(bad_tag, "payload")
+        # Sanity: the queue subsystem is still functional after the
+        # failed attempt.
+        send("post_surrogate_ok", "ok")
+        _, val = receive("post_surrogate_ok", 1)
+        assert val == "ok"
+
+    def test_set_tags_unpaired_surrogate_no_leak(self):
+        """set_tags() with a surrogate tag fails cleanly mid-loop.
+
+        Companion to ``test_send_unpaired_surrogate_tag_no_leak``:
+        exercises the @c _core_set_tags caller of
+        @c tag_from_PyUnicode, which is the second path that the
+        leak fix has to cover. We only assert that the
+        @c UnicodeEncodeError propagates without crashing /
+        deadlocking — set_tags' partial-failure recovery
+        semantics are tracked separately.
+        """
+        with pytest.raises(UnicodeEncodeError):
+            set_tags(["ok_tag", "\ud800"])
+        # Restore queues to a usable state for the rest of the suite.
+        set_tags([])
+
 
 # ===================================================================
 # Queue isolation
diff --git a/test/test_noticeboard.py b/test/test_noticeboard.py
index 456d334..f85efc3 100644
--- a/test/test_noticeboard.py
+++ b/test/test_noticeboard.py
@@ -2,12 +2,13 @@
 
 from functools import partial
 
+import pytest
+
 from bocpy import (Cown, drain, notice_delete, notice_read, notice_sync,
                    notice_update, notice_write, noticeboard,
                    noticeboard_version, receive,
                    REMOVED, send, start, TIMEOUT, wait, when)
 import bocpy._core as _core
-import pytest
 
 
 RECEIVE_TIMEOUT = 10
diff --git a/test/test_scheduler_integration.py b/test/test_scheduler_integration.py
new file mode 100644
index 0000000..eed5864
--- /dev/null
+++ b/test/test_scheduler_integration.py
@@ -0,0 +1,199 @@
+"""Integration tests for the per-worker scheduler.
+
+The data-structure-level coverage of the queue / WSQ primitives
+lives in ``test_internal_wsq.py`` and ``test_internal_mpmcq.py``
+and exercises the C primitives directly via ``_internal_test``.
+This file covers behaviours that can only be validated end-to-end
+through the public ``@when`` surface or through the production
+``_core.scheduler_*`` endpoints:
+
+- **Runtime re-entry**: ``start()`` / ``wait()`` / ``start()`` must
+  complete two independent workloads without leaks.
+- **Paired-release contract**: an uncaught exception inside an
+  ``@when`` body must still release the cown so a follow-on
+  ``@when`` on the same cown is scheduled and runs.
+- **Over-registration contract**: an extra ``scheduler_worker_register()``
+  beyond ``worker_count`` must raise ``RuntimeError`` rather than
+  silently corrupt state.
+
+A prior set of timing-dependent tests (per-worker TLS coverage of
+the ``pushed_local`` path, parked-peer CPU/wall ratio, parked-worker
+wake latency) lived here and were removed: each asserted a property
+that depends on OS scheduler behaviour rather than on bocpy code
+under test, and each was repeatedly flaky on CI runners. The
+underlying mechanisms (pending-eviction, parking, cross-worker
+wake) are exercised end-to-end by every benchmark in ``examples/``
+— a regression there would deadlock or starve the benchmark suite
+long before any threshold-based test would surface a clean failure.
+
+All tests use module-level classes/helpers (workers run in
+sub-interpreters and import the test module to resolve symbols).
+"""
+
+import pytest
+
+import bocpy
+from bocpy import _core
+from bocpy import Cown, drain, receive, send, TIMEOUT, wait, when
+
+
+RECEIVE_TIMEOUT = 30
+
+
+# ---------------------------------------------------------------------------
+# Module-level helpers (must be importable by worker sub-interpreters)
+# ---------------------------------------------------------------------------
+
+
+class _Counter:
+    """Plain counter used as cown payload in chain workloads."""
+
+    __slots__ = ("count",)
+
+    def __init__(self):
+        """Initialise the counter at zero."""
+        self.count = 0
+
+
+def _ensure_quiesced():
+    """Tear down any prior runtime so the test starts from a clean state.
+
+    ``bocpy.wait()`` is a no-op when ``BEHAVIORS`` is ``None``; if a
+    previous test left the runtime up it drains and stops it.
+    """
+    bocpy.wait()
+
+
+# ---------------------------------------------------------------------------
+# Runtime re-entry
+# ---------------------------------------------------------------------------
+
+
+class TestRuntimeReentry:
+    """``start()`` / ``wait()`` / ``start()`` runs two clean workloads."""
+
+    @classmethod
+    def teardown_class(cls):
+        wait()
+        drain("done")
+
+    def test_start_wait_start_runs_two_workloads(self):
+        """Two independent workloads bracketed by start/wait/start/wait.
+
+        The worker pool, terminator, and per-worker queues all spin
+        up cleanly on a second ``start()`` after a prior ``wait()``
+        torn the runtime down. A workload that hangs or drops
+        messages on the second run indicates state leaked across the
+        cycle.
+        """
+        _ensure_quiesced()
+
+        # First workload.
+        bocpy.start(worker_count=2)
+        try:
+            c = Cown(_Counter())
+            for _ in range(50):
+                @when(c)
+                def _(c):
+                    c.value.count += 1
+                    send("done", c.value.count)
+            for _ in range(50):
+                tag, _payload = receive("done", RECEIVE_TIMEOUT)
+                assert tag != TIMEOUT, "first workload stalled"
+        finally:
+            drain("done")
+            wait()
+
+        assert _core.scheduler_stats() == []
+
+        # Second workload after teardown — must come up clean.
+        bocpy.start(worker_count=2)
+        try:
+            c = Cown(_Counter())
+            for _ in range(50):
+                @when(c)
+                def _(c):
+                    c.value.count += 1
+                    send("done", c.value.count)
+            for _ in range(50):
+                tag, _payload = receive("done", RECEIVE_TIMEOUT)
+                assert tag != TIMEOUT, "second workload stalled"
+        finally:
+            drain("done")
+            wait()
+
+
+# ---------------------------------------------------------------------------
+# Paired-release on uncaught body exception
+# ---------------------------------------------------------------------------
+
+
+def _raising_step(c):
+    """Body that raises ``RuntimeError`` after touching the cown."""
+    @when(c)
+    def _(c):
+        c.value.count += 1
+        raise RuntimeError("intentional failure")
+
+
+def _follow_on(c):
+    """Follow-on behaviour that must observe the cown re-acquirable."""
+    @when(c)
+    def _(c):
+        c.value.count += 1
+        send("done", c.value.count)
+
+
+class TestPairedRelease:
+    """An uncaught body exception must still release the cown."""
+
+    @classmethod
+    def teardown_class(cls):
+        wait()
+        drain("done")
+
+    def test_cown_reacquirable_after_uncaught_exception(self):
+        """A failing behaviour releases its cown so the next one runs.
+
+        ``run_behavior`` in ``worker.py`` catches ``Exception`` and
+        funnels it to ``Cown.set_exception``, then runs the
+        release/release_all pair. If the release path were broken the
+        follow-on ``@when(c)`` would block forever; the test would
+        time out on ``receive`` instead of returning a count of 2.
+        """
+        _ensure_quiesced()
+        bocpy.start(worker_count=2)
+        try:
+            c = Cown(_Counter())
+            _raising_step(c)
+            _follow_on(c)
+
+            tag, payload = receive("done", RECEIVE_TIMEOUT)
+            assert tag != TIMEOUT, (
+                "cown was not re-acquired after an uncaught exception"
+            )
+            assert payload == 2, payload
+        finally:
+            drain("done")
+            wait()
+
+
+# ---------------------------------------------------------------------------
+# Over-registration contract on scheduler_worker_register
+# ---------------------------------------------------------------------------
+
+
+def test_over_registration_raises_runtime_error():
+    """An extra register() beyond worker_count must raise RuntimeError.
+
+    With self-allocating registration, the failure mode is
+    over-registration. Production callers (``worker.py``) trust that
+    this raises rather than silently corrupting state.
+    """
+    bocpy.start()
+    try:
+        # Workers have already registered; one more must fail.
+        with pytest.raises(RuntimeError, match="over-registration"):
+            _core.scheduler_worker_register()
+    finally:
+        bocpy.wait()
diff --git a/test/test_scheduler_stats.py b/test/test_scheduler_stats.py
new file mode 100644
index 0000000..29155ef
--- /dev/null
+++ b/test/test_scheduler_stats.py
@@ -0,0 +1,257 @@
+"""Smoke tests for `_core.scheduler_stats()` and `_core.queue_stats()`.
+
+These tests verify:
+- shape of the two snapshots (no crash on empty),
+- that ``scheduler_stats()`` is empty when the runtime is down,
+- that ``wait(stats=True)`` returns the post-session snapshot,
+- that ``queue_stats()`` reflects ``set_tags`` and increments under
+  ``send`` / ``receive``,
+- monotonicity across two consecutive snapshots,
+- that calling either accessor has no observable side effects on the
+  next snapshot's counters.
+"""
+
+import bocpy
+from bocpy import _core, Cown, drain, receive, send, set_tags, wait, when
+
+
+SCHEDULER_FIELDS = {
+    "worker_index",
+    "pushed_local",
+    "dispatched_to_pending",
+    "pushed_remote",
+    "popped_local",
+    "popped_via_steal",
+    "enqueue_cas_retries",
+    "dequeue_cas_retries",
+    "batch_resets",
+    "steal_attempts",
+    "steal_failures",
+    "parked",
+    "last_steal_attempt_ns",
+    "fairness_arm_fires",
+}
+
+QUEUE_FIELDS = {
+    "queue_index",
+    "tag",
+    "enqueue_cas_retries",
+    "dequeue_cas_retries",
+    "pushed_total",
+    "popped_total",
+}
+
+
+def test_scheduler_stats_empty_when_runtime_down():
+    """With the runtime down, the snapshot must be an empty list."""
+    wait()  # ensure runtime is down
+    stats = _core.scheduler_stats()
+    assert isinstance(stats, list)
+    assert stats == []
+
+
+def test_wait_returns_final_snapshot():
+    """`wait(stats=True)` returns the post-session snapshot.
+
+    `_core.scheduler_stats()` after `wait()` is empty because the
+    per-worker array has been freed; `wait(stats=True)` is the
+    correct way to read the counters for the session that just
+    ended.
+    """
+    wait()  # baseline
+    W = 2  # noqa: N806
+    bocpy.start(worker_count=W)
+    c = Cown(0)
+
+    @when(c)
+    def _(c):
+        send("swt_done", 1)
+
+    tag, _payload = receive("swt_done", 5.0)
+    assert tag == "swt_done"
+
+    snapshot = wait(stats=True)
+    assert isinstance(snapshot, list)
+    assert len(snapshot) == W, snapshot
+    for s in snapshot:
+        assert SCHEDULER_FIELDS == set(s.keys()), s
+    # At least one push happened across the pool.
+    assert sum(s["pushed_local"] + s["dispatched_to_pending"]
+               + s["pushed_remote"] for s in snapshot) >= 1
+    # And the per-worker array is gone now.
+    assert _core.scheduler_stats() == []
+
+
+def test_wait_stats_default_returns_none():
+    """`wait()` without `stats=True` returns ``None`` (back-compat)."""
+    wait()
+    assert wait() is None
+    # Even with a real session, default still returns None.
+    bocpy.start(worker_count=2)
+    c = Cown(0)
+
+    @when(c)
+    def _(c):
+        send("swt_default_done", 1)
+
+    receive("swt_default_done", 5.0)
+    assert wait() is None
+
+
+def test_wait_stats_true_when_runtime_never_started():
+    """`wait(stats=True)` returns ``[]`` when no runtime exists."""
+    wait()
+    assert wait(stats=True) == []
+
+
+def test_off_worker_dispatch_bumps_pushed_remote_not_pending():
+    """Main-thread `@when` dispatches use the off-worker (remote) arm.
+
+    `boc_sched_dispatch`'s off-worker arm (`current_worker == NULL`)
+    bumps `pushed_remote` on the round-robin target. It never
+    touches the producer-local `pending` slot, so the resulting
+    snapshot must show `sum(pushed_remote) >= N` and
+    `sum(dispatched_to_pending) == 0`.
+    """
+    wait()
+    W = 4  # noqa: N806
+    N = 16  # noqa: N806
+    bocpy.start(worker_count=W)
+    cowns = [Cown(0) for _ in range(N)]
+    for c in cowns:
+        @when(c)
+        def _(c):
+            send("opp_done", 1)  # noqa: B023
+
+    for _ in range(N):
+        tag, _payload = receive("opp_done", 5.0)
+        assert tag == "opp_done"
+
+    snap = wait(stats=True)
+    total_remote = sum(s["pushed_remote"] for s in snap)
+    total_pending = sum(s["dispatched_to_pending"] for s in snap)
+    assert total_remote >= N, snap
+    # No producer-local arm was ever taken, so pending stays at 0.
+    assert total_pending == 0, snap
+
+
+def test_dispatched_to_pending_increments_from_worker_dispatch():
+    """A worker-side `@when` against a fresh cown bumps `dispatched_to_pending`.
+
+    Inside a behavior body `current_worker != NULL`, so dispatch
+    enters the producer-local arm. With nothing already in the
+    worker's `pending` slot the dispatch falls through the
+    "install into empty pending" branch and bumps
+    `dispatched_to_pending` (not `pushed_local`). With one chained
+    dispatch per outer behavior across N outers, the snapshot
+    must show `sum(dispatched_to_pending) >= N`.
+    """
+    wait()
+    W = 2  # noqa: N806
+    N = 32  # noqa: N806
+    bocpy.start(worker_count=W)
+    outers = [Cown(0) for _ in range(N)]
+    inners = [Cown(0) for _ in range(N)]
+    for o, i in zip(outers, inners):
+        @when(o)
+        def _(o):
+            @when(i)  # noqa: B023
+            def _inner(i):
+                send("ppi_done", 1)
+
+    for _ in range(N):
+        tag, _payload = receive("ppi_done", 5.0)
+        assert tag == "ppi_done"
+
+    snap = wait(stats=True)
+    total_pending = sum(s["dispatched_to_pending"] for s in snap)
+    assert total_pending >= N, snap
+
+
+def test_queue_stats_reflects_set_tags_and_traffic():
+    """`queue_stats` should expose tagged queues with monotonic counters."""
+    set_tags(["t_one", "t_two"])
+    # Drain in case a previous test sent on these tags.
+    drain(["t_one", "t_two"])
+
+    before = _core.queue_stats()
+    by_tag_before = {q["tag"]: q for q in before}
+    assert "t_one" in by_tag_before
+    assert "t_two" in by_tag_before
+    for q in before:
+        assert QUEUE_FIELDS == set(q.keys())
+        assert isinstance(q["queue_index"], int)
+        assert isinstance(q["pushed_total"], int)
+        assert isinstance(q["popped_total"], int)
+        assert q["pushed_total"] >= 0
+        assert q["popped_total"] >= 0
+
+    pushed_before = by_tag_before["t_one"]["pushed_total"]
+    popped_before = by_tag_before["t_one"]["popped_total"]
+
+    send("t_one", "alpha")
+    send("t_one", "beta")
+    msg = receive("t_one")
+    assert msg == ("t_one", "alpha")
+
+    after = _core.queue_stats()
+    by_tag_after = {q["tag"]: q for q in after}
+    assert by_tag_after["t_one"]["pushed_total"] == pushed_before + 2
+    assert by_tag_after["t_one"]["popped_total"] == popped_before + 1
+    # Other tag must not move.
+    assert (by_tag_after["t_two"]["pushed_total"]
+            == by_tag_before["t_two"]["pushed_total"])
+    assert (by_tag_after["t_two"]["popped_total"]
+            == by_tag_before["t_two"]["popped_total"])
+
+
+def test_queue_stats_monotonic_and_no_side_effect():
+    """Calling the snapshots must not perturb the counters."""
+    set_tags(["t_idle"])
+    drain(["t_idle"])
+
+    snap1 = _core.queue_stats()
+    snap2 = _core.queue_stats()
+    snap3 = _core.queue_stats()
+
+    by_tag = lambda snap: {q["tag"]: q for q in snap}  # noqa: E731
+    s1 = by_tag(snap1)
+    s2 = by_tag(snap2)
+    s3 = by_tag(snap3)
+
+    # No traffic between snapshots → counters are stable.
+    for tag in s1:
+        assert s2[tag]["pushed_total"] == s1[tag]["pushed_total"]
+        assert s2[tag]["popped_total"] == s1[tag]["popped_total"]
+        assert s3[tag]["pushed_total"] == s1[tag]["pushed_total"]
+        assert s3[tag]["popped_total"] == s1[tag]["popped_total"]
+
+    # And calling scheduler_stats does not perturb queue_stats either.
+    _ = _core.scheduler_stats()
+    snap4 = _core.queue_stats()
+    s4 = by_tag(snap4)
+    for tag in s1:
+        assert s4[tag]["pushed_total"] == s1[tag]["pushed_total"]
+        assert s4[tag]["popped_total"] == s1[tag]["popped_total"]
+
+
+def test_drain_does_not_decrement_pushed_or_popped_total():
+    """`drain` must clear messages without decrementing the counters.
+
+    The counters track *cumulative* traffic for the lifetime of the
+    process; drain is an administrative operation, not a dequeue.
+    """
+    set_tags(["t_drain"])
+    drain(["t_drain"])
+
+    send("t_drain", "x")
+    send("t_drain", "y")
+
+    before = next(q for q in _core.queue_stats() if q["tag"] == "t_drain")
+    drain(["t_drain"])
+    after = next(q for q in _core.queue_stats() if q["tag"] == "t_drain")
+
+    # Drain pulls the messages out via boc_dequeue, so popped_total
+    # advances. pushed_total must not retreat.
+    assert after["pushed_total"] == before["pushed_total"]
+    assert after["popped_total"] >= before["popped_total"]
diff --git a/test/test_scheduler_steal.py b/test/test_scheduler_steal.py
new file mode 100644
index 0000000..4afa234
--- /dev/null
+++ b/test/test_scheduler_steal.py
@@ -0,0 +1,252 @@
+"""Work-stealing tests.
+
+These tests exercise the work-stealing path end-to-end through the
+public ``@when`` surface and the ``_core.scheduler_stats`` accessor.
+They are the integration-level coverage for stealing; the C-API
+unit coverage of try_steal/steal lives in
+``test_scheduler_pertask_queue.py``.
+
+What the tests assert:
+
+- **Token-work fairness sanity** — a fan-out workload whose size
+  comfortably exceeds ``BATCH_SIZE`` must produce at least one
+  ``steal_attempts`` entry across the worker set, demonstrating the
+  fairness arm (or the empty-queue arm) of ``pop_slow`` fires under
+  realistic load.
+- **Empty-queue race** — starting the runtime with W workers and no
+  work must converge to every worker parked and the process CPU/wall
+  ratio must stay well below 1 (no busy-spinning thieves).
+- **Spurious-failure stress** — placeholder; activated when bocpy is
+  built with ``-DBOC_SCHED_SYSTEMATIC`` (Verona-style fault-injection
+  in the queue links). The flag is off in default builds, so the
+  test is skipped here.
+
+Tests that asserted timing-dependent outcomes (``popped_via_steal >
+0`` after a pinned fan-out, ``fairness_arm_fires >= N`` on a busy
+worker) were removed because their pass/fail depends on OS scheduler
+behaviour rather than on bocpy code under test; the underlying
+mechanisms are exercised end-to-end by the benchmarks in
+``examples/`` and at the data-structure level by
+``test_internal_wsq.py``.
+
+All tests follow the same module-level helper / receive-pattern
+discipline as the other scheduler integration tests (see
+``test_scheduler_integration.py``), because behaviours run on
+worker sub-interpreters that import this module to resolve symbols.
+"""
+
+import time
+
+import pytest
+
+import bocpy
+from bocpy import _core
+from bocpy import Cown, drain, receive, send, TIMEOUT, wait, when
+
+
+RECEIVE_TIMEOUT = 30
+
+
+# ---------------------------------------------------------------------------
+# Module-level helpers (must be importable by worker sub-interpreters)
+# ---------------------------------------------------------------------------
+
+
+class _Counter:
+    """Plain counter used as cown payload in fan-out workloads."""
+
+    __slots__ = ("count",)
+
+    def __init__(self):
+        """Initialise the counter at zero."""
+        self.count = 0
+
+
+def _ensure_quiesced():
+    """Tear down any prior runtime so the test starts from a clean state."""
+    bocpy.wait()
+
+
+def _fanout_done(c_pin, marker):
+    """Final ``@when`` extracted to a helper.
+
+    Inlining inside ``_fanout_kickoff`` would trigger the transpiler
+    nested-capture gap (outer ``marker`` not forwarded into the inner
+    behaviour's capture tuple).
+    """
+    @when(c_pin)
+    def _(c_pin):
+        send("done", marker)
+
+
+def _fanout_kickoff(c_pin, work_cowns, marker):
+    """Fan ``len(work_cowns)`` independent behaviours onto the kickoff worker.
+
+    The kickoff is dispatched from the main thread and lands on
+    whichever worker the off-worker round-robin cursor points at.
+    Inside the body the worker dispatches one trivial behaviour per
+    ``work_cowns`` entry; because every entry is independent (no
+    MCS contention) each dispatch reaches ``boc_sched_dispatch``
+    immediately on the producer-local arm. The first lands in
+    ``pending``; every subsequent dispatch evicts the prior pending
+    into the worker's local queue. After roughly ``BATCH_SIZE``
+    items the queue is the only source of work, and idle peers can
+    steal from it.
+    """
+    @when(c_pin)
+    def _(c_pin):
+        for wc in work_cowns:
+            @when(wc)
+            def _(wc):
+                wc.value.count += 1
+        _fanout_done(c_pin, marker)
+
+
+# ---------------------------------------------------------------------------
+# Token-work fairness sanity
+# ---------------------------------------------------------------------------
+
+
+class TestStealFairnessSanity:
+    """A workload bigger than BATCH_SIZE must exercise the steal arm."""
+
+    @classmethod
+    def teardown_class(cls):
+        wait()
+        drain("done")
+
+    def test_fanout_exceeding_batch_size_provokes_steal_attempts(self):
+        """K > BATCH_SIZE (=100) must produce non-zero steal_attempts.
+
+        ``BATCH_SIZE`` is the consumer-side budget at which the
+        fast-path bypasses ``pending`` to take from the queue. With
+        K=300 fan-out items pinned to one worker, the kickoff worker
+        cycles through its budget at least three times, and idle
+        peers pass through ``pop_slow`` repeatedly looking for work.
+        Every peer entry into the slow path bumps either the
+        fairness arm or the empty-queue arm of ``boc_sched_steal``,
+        which in turn calls ``boc_sched_try_steal`` (one attempt per
+        ring victim per round). The aggregate must be non-zero.
+        """
+        _ensure_quiesced()
+        W = 4  # noqa: N806
+        K = 300  # > BOC_BQ_BATCH_SIZE (100)  # noqa: N806
+        bocpy.start(worker_count=W)
+        try:
+            c_pin = Cown(_Counter())
+            work_cowns = [Cown(_Counter()) for _ in range(K)]
+            _fanout_kickoff(c_pin, work_cowns, "fairness-done")
+
+            tag, _payload = receive("done", RECEIVE_TIMEOUT)
+            assert tag != TIMEOUT, "kickoff failed to complete"
+
+            stats = _core.scheduler_stats()
+        finally:
+            drain("done")
+            wait()
+
+        assert len(stats) == W, stats
+        total_attempts = sum(s["steal_attempts"] for s in stats)
+        # The exact distribution of attempts across workers depends
+        # on scheduling races; we only assert the aggregate is
+        # non-zero. ``last_steal_attempt_ns`` on at least one worker
+        # must also be non-zero (it's stamped on every try_steal
+        # entry).
+        assert total_attempts > 0, (
+            f"no steal_attempts recorded — fairness/empty-queue arms "
+            f"never fired: {stats}"
+        )
+        nonzero_ts = [s for s in stats if s["last_steal_attempt_ns"] > 0]
+        assert len(nonzero_ts) > 0, (
+            f"no worker's last_steal_attempt_ns was set: {stats}"
+        )
+
+
+# ---------------------------------------------------------------------------
+# Empty-queue race: workers with no work must park
+# ---------------------------------------------------------------------------
+
+
+class TestStealEmptyQueueNoSpin:
+    """W workers, 0 work — every worker must park in cnd_wait."""
+
+    @classmethod
+    def teardown_class(cls):
+        wait()
+
+    @pytest.mark.skipif(
+        not hasattr(time, "process_time"),
+        reason="needs time.process_time for CPU accounting",
+    )
+    def test_empty_queue_does_not_spin(self):
+        """Bring the runtime up with W=4 and dispatch no work.
+
+        Every worker enters ``pop_slow``, finds its own queue empty,
+        loops one round of ``boc_sched_steal`` against peers (also
+        empty), and parks under ``cv_mu``. The process CPU/wall
+        ratio over a fixed window must stay well below 1: a single
+        spinning thief alone would push the ratio above 1, and four
+        spinning thieves would push it toward W. We assert
+        ``< 0.5`` to tolerate main-thread overhead, the noticeboard
+        thread, and sub-interpreter startup costs.
+
+        The cumulative ``parked`` counter on every worker must be
+        non-zero at the end of the window (each worker reached the
+        ``cnd_wait`` arm at least once).
+        """
+        _ensure_quiesced()
+        W = 4  # noqa: N806
+        bocpy.start(worker_count=W)
+        try:
+            # Brief warm-up so workers actually reach pop_slow and
+            # commit to parking before we start measuring.
+            time.sleep(0.05)
+
+            wall_start = time.monotonic()
+            cpu_start = time.process_time()
+            time.sleep(0.30)
+            wall_elapsed = time.monotonic() - wall_start
+            cpu_elapsed = time.process_time() - cpu_start
+
+            stats = _core.scheduler_stats()
+        finally:
+            wait()
+
+        ratio = cpu_elapsed / wall_elapsed
+        assert ratio < 0.5, (
+            f"CPU/wall ratio = {ratio:.2f} (cpu={cpu_elapsed:.3f}s, "
+            f"wall={wall_elapsed:.3f}s) over an idle window — "
+            f"workers are not parking"
+        )
+
+        assert len(stats) == W, stats
+        for s in stats:
+            assert s["parked"] > 0, (
+                f"worker {s['worker_index']} never reached cnd_wait "
+                f"in an idle runtime: {s}"
+            )
+
+
+# ---------------------------------------------------------------------------
+# Spurious-failure stress (gated on the systematic-test build flag)
+# ---------------------------------------------------------------------------
+
+
+class TestStealSpuriousFailureStress:
+    """Reserved for ``-DBOC_SCHED_SYSTEMATIC`` builds.
+
+    Verona's stealing path has three documented spurious-failure
+    modes (fully empty victim, single-element victim, first link not
+    yet visible). Verifying convergence under fault-injection
+    requires building bocpy with the ``BOC_SCHED_SYSTEMATIC`` macro,
+    which is off in the default editable install. When that build
+    flavour exists the body of this test should run 100 fan-out
+    iterations and assert each completes within ``RECEIVE_TIMEOUT``.
+    """
+
+    @pytest.mark.skip(
+        reason="needs -DBOC_SCHED_SYSTEMATIC build flag",
+    )
+    def test_spurious_failure_stress(self):  # pragma: no cover
+        """Placeholder; see class docstring."""
+        pass
diff --git a/test/test_scheduling_stress.py b/test/test_scheduling_stress.py
index 70632cd..3ec0f84 100644
--- a/test/test_scheduling_stress.py
+++ b/test/test_scheduling_stress.py
@@ -12,10 +12,12 @@
 import os
 from unittest import mock
 
+import pytest
+
+import bocpy
 from bocpy import _core
 from bocpy import Cown, drain, receive, send, TIMEOUT, wait, when
 import bocpy.behaviors as _behaviors
-import pytest
 
 
 RECEIVE_TIMEOUT = 30
@@ -588,3 +590,386 @@ def _(c):
         _collect_done(1)
         wait()
         assert _core.terminator_count() == 0
+
+
+# ---------------------------------------------------------------------------
+# Chain-ring stress, parameterised over worker_count
+# ---------------------------------------------------------------------------
+
+
+class TestChainRingPerWorkerCount:
+    """Long ring of overlapping pair-locks under varied worker counts.
+
+    Schedules ``ring_length`` behaviours each locking an adjacent
+    ``(c[i], c[(i+1) % ring_length])`` pair against a 64-cown ring.
+    Two-phase locking over the worker-count parameterisation
+    ({1, 2, 4, 8}) exercises the dispatch / pop / 2PL-handoff paths
+    under both serialised and parallel regimes; a regression in the
+    per-worker queue or MCS handoff would manifest as a leak or a
+    missing increment.
+
+    Each parameterised run quiesces the runtime first so the
+    explicit ``worker_count`` actually takes effect — auto-start
+    would otherwise reuse whatever ``WORKER_COUNT`` defaulted to.
+    """
+
+    @classmethod
+    def teardown_class(cls):
+        wait()
+        _drain_done()
+
+    @pytest.mark.parametrize("worker_count", [1, 2, 4, 8])
+    def test_chain_ring(self, worker_count: int):
+        """Ring of pair-locks completes cleanly at every worker count.
+
+        Each behaviour increments both adjacent counters; total sum
+        across the ring must equal ``2 * ring_length``. After
+        ``wait()`` the terminator must return to zero — any leaked
+        hold from the dispatch path (forgotten ``terminator_inc``
+        rollback, skipped ``terminator_dec`` on a worker error path,
+        etc.) would surface here as a non-zero count.
+
+        Also asserts the work-conservation floor on the stats
+        snapshot: ``sum(popped_local + popped_via_steal) >=
+        ring_length``. The proportion between local pops and steals
+        is intentionally *not* asserted — that ratio depends on OS
+        scheduler behaviour (worker wake-up order, sub-interpreter
+        startup latency, kernel pre-emption) and was previously a
+        source of CI flakiness.
+        """
+        wait()
+        bocpy.start(worker_count=worker_count)
+        try:
+            ring_size = 64
+            ring_length = 10_000
+            cowns = [Cown(Counter()) for _ in range(ring_size)]
+
+            for i in range(ring_length):
+                a = cowns[i % ring_size]
+                b = cowns[(i + 1) % ring_size]
+
+                @when(a, b)
+                def _(a, b):
+                    a.value.count += 1
+                    b.value.count += 1
+
+            # Read each counter back through a behaviour so the test
+            # thread observes the final value after all increments
+            # have committed.
+            for idx, c in enumerate(cowns):
+                @when(c)
+                def _(c):
+                    send("done", (idx, c.value.count))  # noqa: B023
+
+            results = _collect_done(ring_size)
+            total = sum(count for _, count in results)
+            assert total == 2 * ring_length, (worker_count, results)
+        finally:
+            _drain_done()
+            # `wait(stats=True)` returns the snapshot captured before
+            # the per-worker array is freed, so we don't need a
+            # pre-wait `_core.scheduler_stats()` call.
+            stats = wait(stats=True)
+            assert _core.terminator_count() == 0
+
+        assert len(stats) == worker_count, stats
+        total_local = sum(s["popped_local"] for s in stats)
+        total_stolen = sum(s["popped_via_steal"] for s in stats)
+        total_pops = total_local + total_stolen
+        # Every behaviour that completes was popped exactly once, so
+        # `total_pops` must reach the dispatched count. We don't need
+        # an exact equality (last-mile read-back behaviours and the
+        # warm-up handshake also count), only a sanity floor.
+        assert total_pops >= ring_length, (
+            f"W={worker_count}: only {total_pops} pops recorded "
+            f"for {ring_length} dispatched behaviours"
+        )
+
+
+# ---------------------------------------------------------------------------
+# Orphan-drain mitigation: set_drop_exception on stop()-orphaned results
+# ---------------------------------------------------------------------------
+
+
+class TestOrphanDropException:
+    """Verify the orphan-drain mitigation surfaces RuntimeError on result Cowns.
+
+    Behaviors orphaned during ``stop()`` surface a
+    :class:`RuntimeError` on their result Cown so callers awaiting
+    ``cown.value`` / ``cown.exception`` after teardown see a
+    diagnostic instead of a permanent ``None``.
+
+    Two layers of coverage:
+
+    1. ``test_set_drop_exception_marks_result_cown`` — direct C-method
+       unit test. Constructs a :class:`_core.BehaviorCapsule` without
+       scheduling it, calls ``set_drop_exception`` on it, then verifies
+       the result Cown's value/exception state matches the worker
+       exception path (``acquire`` → set value → ``exception = True``
+       → ``release``).
+
+    2. ``test_drain_orphan_invokes_set_drop_exception`` — wiring test
+       for ``Behaviors._drain_orphan_behaviors``. Mocks
+       ``_core.scheduler_drain_all_queues`` to return a fake capsule
+       once, then verifies the drain path invokes both
+       ``set_drop_exception`` and ``release_all`` on it.
+    """
+
+    @classmethod
+    def teardown_class(cls):
+        wait()
+        _drain_done()
+
+    def test_set_drop_exception_marks_result_cown(self):
+        """C-method: ``set_drop_exception`` writes value and flag, leaves cown released."""
+        # Drive the runtime to a known state and ensure it is alive
+        # (BehaviorCapsule construction touches per-module C state).
+        wait()
+        from bocpy import start as _start_runtime
+        _start_runtime()
+
+        result = Cown(None)
+        arg = Cown(Counter())
+        # Construct a BehaviorCapsule without scheduling it. The thunk
+        # name does not need to resolve because we never call
+        # ``execute`` — set_drop_exception only touches the result
+        # cown.
+        capsule = _core.BehaviorCapsule(
+            "__behavior_never_called__",
+            result.impl,
+            [(1, arg.impl)],
+            [],
+        )
+
+        drop = RuntimeError("orphaned during stop()")
+        capsule.set_drop_exception(drop)
+
+        # The result Cown must now be in the published-and-released
+        # state with the exception flag set so a post-stop() consumer
+        # can acquire it and observe the failure.
+        result.acquire()
+        try:
+            assert result.exception is True, (
+                "set_drop_exception must mark the result Cown's exception flag"
+            )
+            # Value goes through xidata pickle/unpickle on release/acquire,
+            # so identity is not preserved; check type and message.
+            assert isinstance(result.value, RuntimeError), (
+                f"expected RuntimeError, got {type(result.value).__name__}"
+            )
+            assert "orphaned during stop()" in str(result.value), (
+                f"unexpected message: {result.value!r}"
+            )
+        finally:
+            result.release()
+
+    def test_drain_orphan_invokes_set_drop_exception(self):
+        """``_drain_orphan_behaviors`` calls ``set_drop_exception`` then ``release_all``."""
+        wait()
+        from bocpy import start as _start_runtime
+        _start_runtime()
+
+        # Build a real BehaviorCapsule wrapped in a MagicMock so the
+        # orphan-drain path can call the documented methods on it
+        # (``set_drop_exception``, ``release_all``).
+        fake_capsule = mock.MagicMock()
+        # Single-shot drain: first call returns one fake orphan, the
+        # second call returns [] so the drain loop terminates.
+        drain_results = [[fake_capsule], []]
+        with mock.patch.object(
+            _behaviors._core, "scheduler_drain_all_queues",
+            side_effect=lambda: drain_results.pop(0),
+        ), mock.patch.object(
+            _behaviors._core, "terminator_dec",
+            return_value=0,
+        ):
+            behaviors = bocpy.behaviors.BEHAVIORS
+            assert behaviors is not None, (
+                "runtime must be alive for _drain_orphan_behaviors test"
+            )
+            errors = behaviors._drain_orphan_behaviors()
+
+        assert errors == [], (
+            f"orphan drain reported unexpected errors: {errors!r}"
+        )
+        fake_capsule.set_drop_exception.assert_called_once()
+        # The argument must be a RuntimeError carrying a stop()
+        # diagnostic; the orphan drain UX contract requires the
+        # message reference "stop()" so users can grep for it.
+        sent_arg = fake_capsule.set_drop_exception.call_args[0][0]
+        assert isinstance(sent_arg, RuntimeError), (
+            f"expected RuntimeError, got {type(sent_arg).__name__}"
+        )
+        assert "stop()" in str(sent_arg), (
+            f"drop exception message must mention stop(); got {sent_arg!r}"
+        )
+        fake_capsule.release_all.assert_called_once()
+
+
+# ---------------------------------------------------------------------------
+# Dispatch after runtime stop must surface
+# ---------------------------------------------------------------------------
+
+
+class TestDispatchAfterRuntimeStop:
+    """``boc_sched_dispatch`` must raise once the runtime is torn down.
+
+    Earlier the off-worker dispatch arm silently dropped the node
+    when ``WORKER_COUNT == 0``, leaving the ``whencall`` caller's
+    ``terminator_inc`` un-rolled-back so a subsequent ``wait()``
+    would hang. The fix:
+
+    * ``boc_sched_dispatch`` now sets a ``RuntimeError`` and returns
+      -1 on the no-runtime path,
+    * ``behavior_resolve_one`` propagates the failure (rolling back
+      the queue-owned ``BEHAVIOR_INCREF``),
+    * ``BehaviorCapsule.schedule`` propagates to ``whencall``, whose
+      ``try/except BaseException`` runs ``terminator_dec``,
+    * ``boc_sched_shutdown`` publishes ``WORKER_COUNT = 0`` with a
+      release fence and bumps ``INCARNATION`` so cached
+      ``rr_nonlocal`` TLS in off-worker producers self-invalidates.
+
+    This test exercises the full chain end-to-end.
+    """
+
+    @classmethod
+    def teardown_class(cls):
+        # Ensure the runtime is up for any subsequent test class.
+        wait()
+        _drain_done()
+
+    def test_schedule_after_runtime_stop_raises(self):
+        """A ``@when`` after ``scheduler_runtime_stop`` raises and rolls back."""
+        # Bring the runtime to a clean post-stop state.
+        wait()
+
+        # We need WORKER_COUNT == 0 at the C level. ``wait()`` ran
+        # ``stop_workers`` which already called ``scheduler_runtime_stop``,
+        # so the runtime is down. ``scheduler_stats()`` returns an
+        # empty list iff the per-worker array has been freed.
+        assert _core.scheduler_stats() == [], (
+            "scheduler_runtime_stop should have left WORKER_COUNT == 0"
+        )
+
+        before_count = _core.terminator_count()
+        before_seeded = _core.terminator_seeded()
+        assert before_count == 0 and before_seeded == 0, (
+            f"runtime should be quiesced; got count={before_count}, "
+            f"seeded={before_seeded}"
+        )
+
+        # Bypass the auto-start in the @when fast path by reaching
+        # whencall directly with a Cown whose runtime has been
+        # explicitly stopped. The trick: re-close the terminator and
+        # force WORKER_COUNT to 0 at the same time. We arm both by
+        # going through the public start/wait cycle which leaves
+        # exactly that state. Then we drive a behavior through
+        # ``_core.BehaviorCapsule(...).schedule()`` directly so the
+        # auto-start gate in ``behaviors.py`` cannot wake the
+        # runtime back up between our setup and the dispatch.
+        c = Cown(Counter())
+
+        # Build a behavior capsule by hand so the auto-start path
+        # in ``@when`` does not fire. ``_core.BehaviorCapsule``
+        # takes (thunk_name, result_impl, cowns_with_groups,
+        # captures); ``cowns_with_groups`` is a list of
+        # (group_id, cown_impl) tuples mirroring whencall.
+        result = Cown(None)
+        capsule = _core.BehaviorCapsule(
+            "__nonexistent_thunk__",
+            result.impl,
+            [(1, c.impl)],
+            [],
+        )
+
+        # The terminator is closed after wait(); we must arm it for
+        # this single dispatch attempt the same way whencall would,
+        # then prove the dispatch failure rolls our hold back.
+        # terminator_inc would refuse a closed terminator, so we
+        # seed it via terminator_reset (count=1, seeded=1, closed=0)
+        # to mimic an alive runtime, then forcibly bring
+        # WORKER_COUNT back to 0 by NOT calling start().
+        prior_count, prior_seeded = _core.terminator_reset()
+        # The reset returned the post-wait quiesced state; arm the
+        # terminator for our synthetic schedule attempt.
+        rc = _core.terminator_inc()
+        assert rc >= 0, f"terminator_inc unexpectedly refused: {rc}"
+
+        try:
+            # Direct schedule. With WORKER_COUNT == 0 the off-worker
+            # dispatch arm in boc_sched_dispatch must surface a
+            # RuntimeError rather than silently dropping the node.
+            with pytest.raises(RuntimeError, match="bocpy runtime is not running"):
+                capsule.schedule()
+            # whencall's try/except in behaviors.py would now call
+            # terminator_dec; we mirror that here so the count
+            # returns to its pre-arm state.
+            _core.terminator_dec()
+        finally:
+            # Drop the seed contribution from terminator_reset and
+            # close the terminator so subsequent tests starting
+            # fresh see a clean baseline.
+            _core.terminator_seed_dec()
+            _core.terminator_close()
+
+        # All holds rolled back: count is back to 0 and the
+        # surviving runtime state is clean.
+        assert _core.terminator_count() == 0, (
+            "schedule failure must roll back the synthetic terminator hold"
+        )
+
+    def test_scheduler_runtime_stop_is_idempotent(self):
+        """Calling ``scheduler_runtime_stop`` twice is a no-op the second time.
+
+        ``Behaviors.start()`` includes a defence-in-depth ``except``
+        arm that calls ``_core.scheduler_runtime_stop()`` even when an
+        earlier abort path already called it. This is only safe if the
+        C-side stop is idempotent: a double-free of the per-worker
+        ``WORKERS`` array would corrupt the heap on the second call.
+
+        The test must establish its own precondition (a real runtime
+        has run and been torn down) so it does not pass vacuously
+        under ``pytest -k`` or randomised test ordering. A bare
+        ``wait()`` with ``BEHAVIORS is None`` and ``WORKERS == NULL``
+        would short-circuit every assertion below without ever
+        exercising the second-call path the docstring claims to
+        defend.
+        """
+        # Bring the runtime down to a clean baseline.
+        wait()
+        # Force a genuine runtime cycle: schedule one behaviour so
+        # ``Behaviors.start()`` allocates ``WORKERS``, then ``wait()``
+        # again so ``stop_workers`` performs the *first* real
+        # ``scheduler_runtime_stop`` call. Without this step the
+        # idempotency assertions below would all hit the
+        # ``WORKERS == NULL`` early-out and pass vacuously.
+        c = Cown(Counter())
+
+        @when(c)
+        def _(c):
+            send("done", 1)
+
+        _collect_done(1)
+        # While the runtime is still alive, ``scheduler_stats()`` is
+        # non-empty — this proves the runtime really did come up and
+        # the next ``wait()`` will perform a load-bearing
+        # ``scheduler_runtime_stop``.
+        live_stats = _core.scheduler_stats()
+        assert live_stats, (
+            "runtime must be alive before tearing it down so the first "
+            f"scheduler_runtime_stop has work to do; got {live_stats!r}"
+        )
+        wait()
+        # First (real) call already happened inside ``stop_workers()``.
+        # The array is freed and ``scheduler_stats()`` is empty.
+        assert _core.scheduler_stats() == [], (
+            "wait() should have left WORKER_COUNT == 0"
+        )
+        # A second explicit call must be a no-op (no crash, no error).
+        _core.scheduler_runtime_stop()
+        assert _core.scheduler_stats() == [], (
+            "second scheduler_runtime_stop must leave WORKER_COUNT == 0"
+        )
+        # And a third, for good measure.
+        _core.scheduler_runtime_stop()
+        assert _core.scheduler_stats() == []
diff --git a/test/test_stop_retry_composition.py b/test/test_stop_retry_composition.py
new file mode 100644
index 0000000..c8b60df
--- /dev/null
+++ b/test/test_stop_retry_composition.py
@@ -0,0 +1,175 @@
+"""End-to-end integration test for the stop/retry composition.
+
+Scope of this file
+==================
+
+The various stop/teardown failure modes each have a dedicated
+per-link regression test elsewhere in the suite (cown-acquire
+unpickle rollback, finite-timeout stop with a slow noticeboard
+fn, start-abort-path runtime-stop pairing, off-worker dispatch
+after runtime stop, NaN/Inf timeout validation, and BaseException
+discipline in worker / orphan-drain paths).
+
+The single thing none of those tests exercises **as a unit** is
+the abort/retry path: ``stop(timeout=...)`` times out on a busy
+noticeboard thread, raises, and the runtime is then driven
+through to a clean second ``start() / @when / wait()`` cycle.
+That composition is what this file covers.
+
+We deliberately omit a payload with raising ``__setstate__``
+here because ``Behaviors.stop_workers`` walks every Python frame
+on the calling thread and calls ``acquire()`` on every ``Cown``
+it finds. In a pytest environment the test runner retains
+references to test locals, so a payload whose ``__setstate__``
+raises gets re-unpickled during teardown and crashes the second
+stop unrelated to the abort/retry path. That deserialisation
+rollback is fully exercised by its dedicated test.
+"""
+
+import time
+
+import pytest
+
+import bocpy
+from bocpy import _core
+from bocpy import Cown, drain, notice_update, receive, send, TIMEOUT, wait, when
+
+
+RECEIVE_TIMEOUT = 10
+# Slow-fn duration: long enough that ``wait(timeout=0.1)`` reliably
+# hits the noticeboard-join timeout, short enough that the test does
+# not bloat the suite.
+SLOW_FN_SECONDS = 0.6
+
+
+# ---------------------------------------------------------------------------
+# Module-level helpers (must be picklable across the boc_noticeboard queue).
+# ---------------------------------------------------------------------------
+
+
+def _slow_update_fn(_x):
+    """Sleep on the noticeboard thread, then return a fresh value.
+
+    Picklable because it is a module-level function. The argument is
+    ignored -- the helper exists solely to occupy the noticeboard
+    thread for ``SLOW_FN_SECONDS`` so a subsequent
+    ``wait(timeout=0.1)`` reliably hits the noticeboard-join timeout.
+    """
+    time.sleep(SLOW_FN_SECONDS)
+    return 1
+
+
+# ---------------------------------------------------------------------------
+# Stop-timeout and retry composition test
+# ---------------------------------------------------------------------------
+
+
+class TestStopTimeoutAndRetry:
+    """Stop-timeout abort followed by clean retry.
+
+    Drives the abort/retry path that no per-link unit test covers
+    as a unit:
+
+    1. ``notice_update`` posts a slow fn to the noticeboard
+       thread. ``wait(timeout=0.1)`` times out on the noticeboard
+       join and raises ``RuntimeError``.
+    2. The orphan-drain mitigation runs before the raise, so
+       ``terminator_count`` is 0 even on the failure path.
+    3. After the noticeboard fn finishes and a second ``wait()``
+       drives teardown to completion, ``start()`` is called
+       again. The ``scheduler_runtime_stop`` pairing on the abort
+       paths means the new runtime does not inherit a leaked
+       ``WORKERS`` array from the timed-out one.
+    4. A ``@when`` on the new runtime succeeds. If
+       ``boc_sched_dispatch`` failure were silent, this would hang
+       or surface a "scheduler not running" error.
+    """
+
+    @classmethod
+    def teardown_class(cls):
+        """Drain the runtime and any leftover messages."""
+        wait()
+        drain("retry_done")
+
+    def test_stop_timeout_then_retry(self):
+        """Time out on a slow noticeboard fn, then retry start() cleanly."""
+        # Begin from a known-clean state.
+        wait()
+
+        # ----- Step 1: schedule a slow notice_update -----
+        bocpy.start(worker_count=1)
+        try:
+            notice_update("retry_key", _slow_update_fn, default=0)
+            # Yield long enough for the noticeboard thread to
+            # dequeue the update and enter ``time.sleep``. Without
+            # this, on a very fast machine ``wait(timeout=0.1)``
+            # could race the message dequeue and the noticeboard
+            # thread would shut down cleanly inside the 0.1s budget.
+            time.sleep(0.05)
+        except BaseException:
+            try:
+                wait()
+            except Exception:
+                pass
+            raise
+
+        # ----- Step 2: stop times out, but the orphan drain still ran -----
+        with pytest.raises(RuntimeError, match="noticeboard thread did not shut down"):
+            wait(timeout=0.1)
+
+        # The orphan drain ran before the raise, so the C-side
+        # terminator_count is back to 0. Without that drain the
+        # count would still reflect in-flight @when traffic and
+        # the next start() would diagnose terminator drift.
+        assert _core.terminator_count() == 0, (
+            "terminator_count is non-zero after wait(timeout=0.1) "
+            "timed out on the noticeboard join. The orphan drain "
+            "did not run before the RuntimeError."
+        )
+
+        # ----- Step 3: drain the slow fn, finish teardown -----
+        # The retry path in ``stop()`` calls
+        # ``noticeboard.join(_remaining())``. We invoke ``wait()``
+        # with no timeout here, so ``_remaining()`` returns ``None``
+        # and the join is unbounded -- the second ``wait()`` blocks
+        # deterministically until the slow fn completes and the
+        # noticeboard thread exits, with no ``time.sleep`` slack
+        # required. A retry that supplied a finite ``timeout=`` would
+        # see a bounded join and would still need explicit
+        # synchronisation to guarantee the slow fn has completed.
+        wait()
+
+        # ----- Step 4: fresh start + schedule -----
+        # If the scheduler_runtime_stop pairing on abort paths or
+        # the dispatch-failure-observable change were regressed,
+        # this start() / @when cycle would either crash or hang.
+        bocpy.start(worker_count=2)
+        try:
+            self._run_fresh_when()
+        finally:
+            drain("retry_done")
+            wait()
+
+    def _run_fresh_when(self):
+        """Schedule a @when on the second runtime and confirm it ran.
+
+        Wrapped in a helper so the ``fresh`` Cown leaves scope
+        before the final ``wait()``.
+        """
+        fresh = Cown(0)
+
+        @when(fresh)
+        def _(c):
+            send("retry_done", ("fresh_ran", c.value))
+
+        tag, payload = receive("retry_done", RECEIVE_TIMEOUT)
+        assert tag != TIMEOUT, (
+            "@when on a fresh Cown after retry never ran -- the "
+            "scheduler did not re-arm cleanly after the "
+            "timed-out stop()"
+        )
+        assert payload == ("fresh_ran", 0), (
+            f"unexpected payload {payload!r} from fresh @when; a "
+            "'cannot acquire cown' error here would indicate a "
+            "leaked owner from the prior runtime"
+        )
diff --git a/test/test_transpiler.py b/test/test_transpiler.py
index e34791d..2f4d51e 100644
--- a/test/test_transpiler.py
+++ b/test/test_transpiler.py
@@ -99,6 +99,79 @@ def f():
                 return known
         """, known_vars={"known"}) == set()
 
+
+class TestCapturedNestedWhen:
+    """Names referenced only inside a nested @when must propagate outward.
+
+    A nested @when is rewritten by ``WhenTransformer`` into a ``whencall(...)``
+    in the outer behavior's frame, so its captures and cown arguments must be
+    available there. Plain nested ``def``s keep the existing opaque
+    treatment because Python's own closure handles them.
+    """
+
+    @staticmethod
+    def _captures(source, known_vars=frozenset()):
+        tree = ast.parse(textwrap.dedent(source))
+        finder = CapturedVariableFinder(set(known_vars))
+        finder.visit(tree.body[0])
+        return finder.captured_vars
+
+    def test_inner_when_capture_propagates(self):
+        # `marker` is referenced only inside the nested @when body, but must
+        # be captured by the outer behavior so the inner whencall can see it.
+        caps = self._captures("""\
+            def outer(c):
+                @when(c)
+                def _(c):
+                    use(marker)
+        """, known_vars={"when", "use"})
+        assert "marker" in caps
+
+    def test_inner_when_decorator_arg_propagates(self):
+        # The cown argument to the nested @when is evaluated in the outer
+        # frame, so it must also be captured.
+        caps = self._captures("""\
+            def outer():
+                @when(other_cown)
+                def _(x):
+                    pass
+        """, known_vars={"when"})
+        assert "other_cown" in caps
+
+    def test_inner_when_locals_not_captured(self):
+        # Names that are local/params of the inner @when should NOT leak out.
+        caps = self._captures("""\
+            def outer():
+                @when(c)
+                def _(c):
+                    x = 1
+                    use(x, c)
+        """, known_vars={"when", "use"})
+        assert caps == {"c"}
+
+    def test_plain_nested_def_unchanged(self):
+        # A plain (non-@when) nested def keeps its opaque treatment: names
+        # used only inside its body do not surface in the outer's captures.
+        caps = self._captures("""\
+            def outer():
+                def helper():
+                    return inner_only
+        """)
+        assert caps == set()
+
+    def test_deeply_nested_when_propagates(self):
+        # A name referenced in a doubly-nested @when must propagate all the
+        # way out to the top-level behavior.
+        caps = self._captures("""\
+            def outer(c):
+                @when(c)
+                def _(c):
+                    @when(c)
+                    def _(c):
+                        use(deep_marker)
+        """, known_vars={"when", "use"})
+        assert "deep_marker" in caps
+
     def test_mixed_locals_and_captures(self):
         caps = self._captures("""\
             def f(a):