Merge dev to main by zbigniewsobiecki · Pull Request #1352 · mongrel-intelligence/cascade

zbigniewsobiecki · 2026-05-11T21:57:43Z

Routine dev → main promotion. 15 commits across 3 feature PRs + review-feedback follow-ups.

Feature PRs landing

fix(codex): compute cost from token deltas, not nonexistent cost_usd #1346 — fix(codex): compute cost from token deltas, not nonexistent cost_usd. Fixes silent zero/inflated cost on every Codex run; adds pricing rows for all CODEX_MODEL_IDS; deletes dead cost_usd extraction; folds reasoning tokens into output for billing.
fix(scm): support body-file for update PR comment #1350 — fix(scm): support --body-file for update-pr-comment gadget (MNG-718).
fix: review-trigger reliability + codex late shell-state false-positive #1351 — fix(triggers + codex): three prod bugs from 2026-05-11 — check-suite-success deferred-recheck (PR Merge dev to main #394 / MNG-683), base-branch gate bypass for cascade-authored stacked PRs (PR feat(ci): add manual release workflow to merge dev into main #393 / MNG-691), codex late shell-state false-positive (MNG-718 / PR fix(scm): support body-file for update PR comment #1350).

Review-feedback commits on top

c483f440 codex: treat reasoning_output_tokens as output subset, not extra counter
4de56aa0 codex: return all-zero delta on regression; constrain resolver to priced models
e21c3393 codex: add gpt-5.4-mini to CODEX_MODELS and MODEL_PRICING
5d8e3c30 check-suite: distinguish check-suite rechecks from mergeability rechecks
438748ef triggers: failure gate bypass + respond-to-ci cross-process dedup
76175c02 triggers: failure defer recheck + extend dedup TTL
e4b6929c extend respond-to-ci dedup TTL to cover full worker window + update docs
Plus 2 minor follow-ups (b31dcddc, 26747475)

Risk

Largest change set since 2026-05-09. Mitigations:

All trigger changes hit pre-push (3110+ tests) + manual conformance pass at every step.
Codex changes pin TDD regression tests for both delta-computation and shell-state false-positive paths.
Three active prod monitors armed against the ucho pipeline for review-trigger anomalies.

🤖 Generated with Claude Code

Codex cost logging was silently broken in three compounding ways: (1) `codex exec --json` never emits cost_usd fields (openai/codex#17539), so the parser branches reading total_cost_usd / cost_usd / usage.cost_usd were dead code; (2) `usage` on turn.completed is the CUMULATIVE session total, not per-turn — persisting it per row gave dashboards O(N²) inflated totals when summed; (3) the pricing table had no entry for gpt-5.4 / gpt-5.3-codex / gpt-5.3-codex-spark / codex-mini-latest, so even if cost-from-tokens were wired, calculateCost returned 0. Now: the engine tracks a run-level cumulative high-water mark, computes per-turn deltas on each turn.completed (clamping out-of-order regressions to 0), folds reasoning_output_tokens into output for billing, and computes cost via calculateCost('openai:'+model, delta) — mirroring the working claude-code path. context.cost accumulates additively instead of overwriting. Public OpenAI metered rates added for every Codex model; gpt-5.3-codex-spark proxies to gpt-5.3-codex rates since it has no metered API listing. A coverage test fails CI when a new Codex model is added without a pricing row. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… counter OpenAI/Codex usage reports reasoning_output_tokens as a breakdown of output_tokens (not an additional counter). A turn with output_tokens:47 and reasoning_output_tokens:41 has 47 total output tokens, 41 of which are reasoning. The previous code computed billableOutput = outputTokens + reasoningTokens, inflating a turn with {output:200, reasoning:800} to 1000 billed output tokens instead of the correct 200. Fix: use delta.outputTokens directly for billing and outputForRow. Store delta.reasoningTokens in the response payload for observability only. Update Test 3 in codex-cost.test.ts to use realistic values where reasoning_output_tokens is a subset of output_tokens and assert that outputTokens (not outputTokens+reasoning) is billed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…o priced models Two accounting paths could silently persist wrong or zero Codex cost: 1. computeTurnDelta returned the per-field Math.max(0, curr-prev) delta even when the backwards-cumulative guard tripped. If inputTokens regressed while outputTokens increased, the positive outputTokens delta was still persisted — and double-counted on the next valid event because the high-water mark was not advanced. Fix: return an all-zero delta when any counter regresses, discarding the entire event rather than a mix of zero and positive fields. 2. resolveCodexModel() accepted any openai:* string and any gpt-*codex* pattern, but MODEL_PRICING only covers CODEX_MODEL_IDS. A project configured with e.g. openai:gpt-5-codex would run through the Codex engine and then persist cost=0 because calculateCost('openai:gpt-5-codex', ...) returns 0. Fix: constrain the resolver to CODEX_MODEL_IDS (with optional openai: prefix); remove the wildcard gpt-*codex* branch entirely. Tests added: - codex-cost.test.ts Test 5b: mixed-regression scenario (inputTokens down, outputTokens up) verifies all-zero delta and no double-counting on the following valid event. - codex.test.ts: two new resolveCodexModel tests — openai:gpt-5-codex and gpt-5-turbo-codex both throw rather than silently accepting unpriced models. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Restores support for openai:gpt-5.4-mini which was previously accepted via the old openai:* wildcard in resolveCodexModel but became inaccessible after the resolver was tightened to only allow models in CODEX_MODEL_IDS. Adds gpt-5.4-mini to CODEX_MODELS (models.ts) and a pricing row at input=$0.75, cachedInput=$0.075, output=$4.50 per 1M (from developers.openai.com/api/docs/models/gpt-5.4-mini, 2026-05-11). The existing llmMetrics-codex.test.ts coverage test automatically pins the new row — failing loudly if pricing is ever dropped. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…k-suite completion Two distinct production bugs surfaced on project ucho (2026-05-11) where the review agent failed to auto-dispatch after a PR's CI passed, both requiring manual review requests to unblock. Bug 1 (PR #394, MNG-683): check-suite-success deferred-on-incomplete returned a plain skip and relied on GitHub firing another check_suite event when the final suite finished. When the Actions API lagged webhook delivery, the API still showed the final suite as in_progress at query time AND no further webhook arrived — review never fired. Fix: schedule a deferredRecheck (30s, coalesce-keyed on owner/repo/PR/ head-SHA) so the trigger re-evaluates against fresh API state. Bug 2 (PR #393, MNG-691): stacked PR targeting feature/MNG-690-… was rejected by the base-branch gate even though cascade authored it. The gate exists to filter human or third-party-bot drive-bys, which the upstream persona check already handles. Fix: skip the base-branch gate for cascade-authored PRs; non-cascade authors still get the rejection. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…sful runs When codex's persistent bash session breaks with `write_stdin failed: stdin is closed`, the engine used to fail the run unconditionally — even when all real work was already captured. Prod evidence (cascade/MNG-718, run f801342b, 2026-05-11): an implementation agent opened PR #1350, ran the full verification suite, filed a follow-up friction ticket, and was still marked failed because the signal fired during session-close. This cost cascade the post-completion review dispatch and surfaced a misleading "agent failed" comment via aaight42. The discriminator is whether success evidence was captured BEFORE the signal: if `prUrl` is set and `finalOutput` is non-empty, the corruption was late and the work is real → log WARN, fall through to success. Otherwise the corruption may have masked the agent's work → fail loudly so ops retry (preserved behavior from PR #1245). Extracted classifyShellCorruption() to keep execute()'s cognitive complexity unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ent-body-file fix(scm): support body-file for update PR comment

…ging fix(codex): compute cost from token deltas, not nonexistent cost_usd

…rechecks Adds `recheckKind: 'check-suite'` to `TriggerResult.deferredRecheck` so the router stamps `checkSuiteRecheckAttempt` on the job (not `mergeabilityRecheckAttempt`). On the worker side, `processGitHubWebhook` now reschedules another coalesced delayed job when `isCheckSuiteRecheckJob=true` and the trigger is still stale, rather than silently dropping the dispatch via the `mergeability_recheck_exhausted` path. The recheck-handling logic is extracted into a `handleRecheckResult` helper to keep `processGitHubWebhook` within the lint complexity budget. Closes the silent-drop bug where a 30s check-suite recheck that found the Actions API still stale would never trigger review again. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…d-to-ci cross-process dedup Issue 1: CheckSuiteFailureTrigger used `gateCascadePersona ?? gateBaseBranch` which evaluated gateBaseBranch even for cascade-authored PRs. Cascade-authored stacked PRs targeting a non-base branch were incorrectly skipped instead of dispatching respond-to-ci. Fix: replace the `??` chain with an explicit gateCascadePersona check, mirroring the authorIsCascade bypass already present in decideCheckSuiteGates. Issue 2: The success handler's 30s deferred recheck fired in a fresh worker container with an empty in-process fixAttempts Map — it had no memory of what the router had already dispatched. This allowed duplicate respond-to-ci agents for the same PR+SHA. Fix: add Redis-backed dedup (respond-to-ci-dedup.ts) to dispatchRespondToCi, mirroring the review-dispatch-dedup pattern. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Two gaps closed from the 2026-05-11 review (nhopeatall, PR #1351): 1. check-suite-failure.ts — the `defer` branch now returns a deferredRecheck (30s delay, recheckKind: 'check-suite') instead of a plain skip(), matching the API-lag fix already applied to check-suite-success. Repro: final check_suite.completed webhook with conclusion=failure arrives while Actions API still reports a check as in_progress; no follow-up webhook fires, so the plain skip would permanently lose respond-to-ci dispatch. 2. respond-to-ci-dedup.ts — TTL extended from 2 min to 10 min to cover the full duplicate window: 30s deferred-recheck delay + up to 5min worker-slot wait (slotWaitTimeoutMs default) + buffer. The 2-min key could expire before a recheck job even started under backlog, allowing the recheck to reclaim the key and launch a second fix agent for the same PR+SHA. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ate docs Increases DEDUP_TTL_SEC from 10 min to 35 min to cover the full 30-minute workerTimeoutMs window (+ 5-min buffer). The 10-min key could expire while the original respond-to-ci job was still running, allowing a check-suite recheck to claim the same PR+SHA and launch a duplicate fix agent. Also updates AGENTS.md/CLAUDE.md, src/triggers/README.md, and docs/architecture/{01,02,03,10} to document the two deferred-recheck kinds (mergeabilityRecheckAttempt one-shot vs checkSuiteRecheckAttempt safe-rescheduling), replacing stale descriptions that only mentioned the mergeability path. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…fer-and-stacked-prs

zbigniewsobiecki and others added 15 commits May 11, 2026 15:40

fix: address feedback

b31dcdd

fix(scm): support body-file for update PR comment

2674747

Merge pull request #1350 from mongrel-intelligence/fix/update-pr-comm…

4f8c6fa

…ent-body-file fix(scm): support body-file for update PR comment

Merge pull request #1346 from mongrel-intelligence/fix/codex-cost-log…

fc563a8

…ging fix(codex): compute cost from token deltas, not nonexistent cost_usd

Merge pull request #1351 from mongrel-intelligence/fix/check-suite-de…

b52675f

…fer-and-stacked-prs

zbigniewsobiecki merged commit 13f3cb5 into main May 11, 2026
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge dev to main#1352

Merge dev to main#1352
zbigniewsobiecki merged 15 commits into
mainfrom
dev

zbigniewsobiecki commented May 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zbigniewsobiecki commented May 11, 2026

Feature PRs landing

Review-feedback commits on top

Risk

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant