Skip to content

Merge dev to main#1352

Merged
zbigniewsobiecki merged 15 commits into
mainfrom
dev
May 11, 2026
Merged

Merge dev to main#1352
zbigniewsobiecki merged 15 commits into
mainfrom
dev

Conversation

@zbigniewsobiecki
Copy link
Copy Markdown
Member

Routine dev → main promotion. 15 commits across 3 feature PRs + review-feedback follow-ups.

Feature PRs landing

Review-feedback commits on top

  • c483f440 codex: treat reasoning_output_tokens as output subset, not extra counter
  • 4de56aa0 codex: return all-zero delta on regression; constrain resolver to priced models
  • e21c3393 codex: add gpt-5.4-mini to CODEX_MODELS and MODEL_PRICING
  • 5d8e3c30 check-suite: distinguish check-suite rechecks from mergeability rechecks
  • 438748ef triggers: failure gate bypass + respond-to-ci cross-process dedup
  • 76175c02 triggers: failure defer recheck + extend dedup TTL
  • e4b6929c extend respond-to-ci dedup TTL to cover full worker window + update docs
  • Plus 2 minor follow-ups (b31dcddc, 26747475)

Risk

Largest change set since 2026-05-09. Mitigations:

  • All trigger changes hit pre-push (3110+ tests) + manual conformance pass at every step.
  • Codex changes pin TDD regression tests for both delta-computation and shell-state false-positive paths.
  • Three active prod monitors armed against the ucho pipeline for review-trigger anomalies.

🤖 Generated with Claude Code

zbigniewsobiecki and others added 15 commits May 11, 2026 15:40
Codex cost logging was silently broken in three compounding ways: (1)
`codex exec --json` never emits cost_usd fields (openai/codex#17539), so
the parser branches reading total_cost_usd / cost_usd / usage.cost_usd
were dead code; (2) `usage` on turn.completed is the CUMULATIVE session
total, not per-turn — persisting it per row gave dashboards O(N²)
inflated totals when summed; (3) the pricing table had no entry for
gpt-5.4 / gpt-5.3-codex / gpt-5.3-codex-spark / codex-mini-latest, so
even if cost-from-tokens were wired, calculateCost returned 0.

Now: the engine tracks a run-level cumulative high-water mark, computes
per-turn deltas on each turn.completed (clamping out-of-order regressions
to 0), folds reasoning_output_tokens into output for billing, and
computes cost via calculateCost('openai:'+model, delta) — mirroring the
working claude-code path. context.cost accumulates additively instead of
overwriting. Public OpenAI metered rates added for every Codex model;
gpt-5.3-codex-spark proxies to gpt-5.3-codex rates since it has no
metered API listing. A coverage test fails CI when a new Codex model is
added without a pricing row.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… counter

OpenAI/Codex usage reports reasoning_output_tokens as a breakdown of
output_tokens (not an additional counter). A turn with output_tokens:47
and reasoning_output_tokens:41 has 47 total output tokens, 41 of which
are reasoning. The previous code computed billableOutput = outputTokens +
reasoningTokens, inflating a turn with {output:200, reasoning:800} to
1000 billed output tokens instead of the correct 200.

Fix: use delta.outputTokens directly for billing and outputForRow. Store
delta.reasoningTokens in the response payload for observability only.

Update Test 3 in codex-cost.test.ts to use realistic values where
reasoning_output_tokens is a subset of output_tokens and assert that
outputTokens (not outputTokens+reasoning) is billed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…o priced models

Two accounting paths could silently persist wrong or zero Codex cost:

1. computeTurnDelta returned the per-field Math.max(0, curr-prev) delta even when
   the backwards-cumulative guard tripped. If inputTokens regressed while
   outputTokens increased, the positive outputTokens delta was still persisted —
   and double-counted on the next valid event because the high-water mark was not
   advanced. Fix: return an all-zero delta when any counter regresses, discarding
   the entire event rather than a mix of zero and positive fields.

2. resolveCodexModel() accepted any openai:* string and any gpt-*codex* pattern,
   but MODEL_PRICING only covers CODEX_MODEL_IDS. A project configured with e.g.
   openai:gpt-5-codex would run through the Codex engine and then persist cost=0
   because calculateCost('openai:gpt-5-codex', ...) returns 0. Fix: constrain the
   resolver to CODEX_MODEL_IDS (with optional openai: prefix); remove the wildcard
   gpt-*codex* branch entirely.

Tests added:
- codex-cost.test.ts Test 5b: mixed-regression scenario (inputTokens down, outputTokens
  up) verifies all-zero delta and no double-counting on the following valid event.
- codex.test.ts: two new resolveCodexModel tests — openai:gpt-5-codex and
  gpt-5-turbo-codex both throw rather than silently accepting unpriced models.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Restores support for openai:gpt-5.4-mini which was previously accepted
via the old openai:* wildcard in resolveCodexModel but became inaccessible
after the resolver was tightened to only allow models in CODEX_MODEL_IDS.

Adds gpt-5.4-mini to CODEX_MODELS (models.ts) and a pricing row at
input=$0.75, cachedInput=$0.075, output=$4.50 per 1M (from
developers.openai.com/api/docs/models/gpt-5.4-mini, 2026-05-11).

The existing llmMetrics-codex.test.ts coverage test automatically pins
the new row — failing loudly if pricing is ever dropped.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…k-suite completion

Two distinct production bugs surfaced on project ucho (2026-05-11) where
the review agent failed to auto-dispatch after a PR's CI passed, both
requiring manual review requests to unblock.

Bug 1 (PR #394, MNG-683): check-suite-success deferred-on-incomplete
returned a plain skip and relied on GitHub firing another check_suite
event when the final suite finished. When the Actions API lagged
webhook delivery, the API still showed the final suite as in_progress
at query time AND no further webhook arrived — review never fired.
Fix: schedule a deferredRecheck (30s, coalesce-keyed on owner/repo/PR/
head-SHA) so the trigger re-evaluates against fresh API state.

Bug 2 (PR #393, MNG-691): stacked PR targeting feature/MNG-690-… was
rejected by the base-branch gate even though cascade authored it. The
gate exists to filter human or third-party-bot drive-bys, which the
upstream persona check already handles. Fix: skip the base-branch gate
for cascade-authored PRs; non-cascade authors still get the rejection.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…sful runs

When codex's persistent bash session breaks with `write_stdin failed:
stdin is closed`, the engine used to fail the run unconditionally — even
when all real work was already captured. Prod evidence (cascade/MNG-718,
run f801342b, 2026-05-11): an implementation agent opened PR #1350, ran
the full verification suite, filed a follow-up friction ticket, and was
still marked failed because the signal fired during session-close. This
cost cascade the post-completion review dispatch and surfaced a
misleading "agent failed" comment via aaight42.

The discriminator is whether success evidence was captured BEFORE the
signal: if `prUrl` is set and `finalOutput` is non-empty, the corruption
was late and the work is real → log WARN, fall through to success.
Otherwise the corruption may have masked the agent's work → fail loudly
so ops retry (preserved behavior from PR #1245).

Extracted classifyShellCorruption() to keep execute()'s cognitive
complexity unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ent-body-file

fix(scm): support body-file for update PR comment
…ging

fix(codex): compute cost from token deltas, not nonexistent cost_usd
…rechecks

Adds `recheckKind: 'check-suite'` to `TriggerResult.deferredRecheck` so
the router stamps `checkSuiteRecheckAttempt` on the job (not
`mergeabilityRecheckAttempt`).  On the worker side,
`processGitHubWebhook` now reschedules another coalesced delayed job
when `isCheckSuiteRecheckJob=true` and the trigger is still stale,
rather than silently dropping the dispatch via the
`mergeability_recheck_exhausted` path.

The recheck-handling logic is extracted into a `handleRecheckResult`
helper to keep `processGitHubWebhook` within the lint complexity budget.

Closes the silent-drop bug where a 30s check-suite recheck that found
the Actions API still stale would never trigger review again.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…d-to-ci cross-process dedup

Issue 1: CheckSuiteFailureTrigger used `gateCascadePersona ?? gateBaseBranch` which
evaluated gateBaseBranch even for cascade-authored PRs. Cascade-authored stacked PRs
targeting a non-base branch were incorrectly skipped instead of dispatching respond-to-ci.
Fix: replace the `??` chain with an explicit gateCascadePersona check, mirroring the
authorIsCascade bypass already present in decideCheckSuiteGates.

Issue 2: The success handler's 30s deferred recheck fired in a fresh worker container
with an empty in-process fixAttempts Map — it had no memory of what the router had already
dispatched. This allowed duplicate respond-to-ci agents for the same PR+SHA. Fix: add
Redis-backed dedup (respond-to-ci-dedup.ts) to dispatchRespondToCi, mirroring the
review-dispatch-dedup pattern.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two gaps closed from the 2026-05-11 review (nhopeatall, PR #1351):

1. check-suite-failure.ts — the `defer` branch now returns a deferredRecheck
   (30s delay, recheckKind: 'check-suite') instead of a plain skip(), matching
   the API-lag fix already applied to check-suite-success. Repro: final
   check_suite.completed webhook with conclusion=failure arrives while Actions
   API still reports a check as in_progress; no follow-up webhook fires, so
   the plain skip would permanently lose respond-to-ci dispatch.

2. respond-to-ci-dedup.ts — TTL extended from 2 min to 10 min to cover the
   full duplicate window: 30s deferred-recheck delay + up to 5min worker-slot
   wait (slotWaitTimeoutMs default) + buffer. The 2-min key could expire before
   a recheck job even started under backlog, allowing the recheck to reclaim
   the key and launch a second fix agent for the same PR+SHA.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ate docs

Increases DEDUP_TTL_SEC from 10 min to 35 min to cover the full 30-minute
workerTimeoutMs window (+ 5-min buffer). The 10-min key could expire while
the original respond-to-ci job was still running, allowing a check-suite
recheck to claim the same PR+SHA and launch a duplicate fix agent.

Also updates AGENTS.md/CLAUDE.md, src/triggers/README.md, and
docs/architecture/{01,02,03,10} to document the two deferred-recheck
kinds (mergeabilityRecheckAttempt one-shot vs checkSuiteRecheckAttempt
safe-rescheduling), replacing stale descriptions that only mentioned the
mergeability path.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@zbigniewsobiecki zbigniewsobiecki merged commit 13f3cb5 into main May 11, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant