Merge dev to main#1352
Merged
Merged
Conversation
Codex cost logging was silently broken in three compounding ways: (1) `codex exec --json` never emits cost_usd fields (openai/codex#17539), so the parser branches reading total_cost_usd / cost_usd / usage.cost_usd were dead code; (2) `usage` on turn.completed is the CUMULATIVE session total, not per-turn — persisting it per row gave dashboards O(N²) inflated totals when summed; (3) the pricing table had no entry for gpt-5.4 / gpt-5.3-codex / gpt-5.3-codex-spark / codex-mini-latest, so even if cost-from-tokens were wired, calculateCost returned 0. Now: the engine tracks a run-level cumulative high-water mark, computes per-turn deltas on each turn.completed (clamping out-of-order regressions to 0), folds reasoning_output_tokens into output for billing, and computes cost via calculateCost('openai:'+model, delta) — mirroring the working claude-code path. context.cost accumulates additively instead of overwriting. Public OpenAI metered rates added for every Codex model; gpt-5.3-codex-spark proxies to gpt-5.3-codex rates since it has no metered API listing. A coverage test fails CI when a new Codex model is added without a pricing row. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… counter
OpenAI/Codex usage reports reasoning_output_tokens as a breakdown of
output_tokens (not an additional counter). A turn with output_tokens:47
and reasoning_output_tokens:41 has 47 total output tokens, 41 of which
are reasoning. The previous code computed billableOutput = outputTokens +
reasoningTokens, inflating a turn with {output:200, reasoning:800} to
1000 billed output tokens instead of the correct 200.
Fix: use delta.outputTokens directly for billing and outputForRow. Store
delta.reasoningTokens in the response payload for observability only.
Update Test 3 in codex-cost.test.ts to use realistic values where
reasoning_output_tokens is a subset of output_tokens and assert that
outputTokens (not outputTokens+reasoning) is billed.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…o priced models
Two accounting paths could silently persist wrong or zero Codex cost:
1. computeTurnDelta returned the per-field Math.max(0, curr-prev) delta even when
the backwards-cumulative guard tripped. If inputTokens regressed while
outputTokens increased, the positive outputTokens delta was still persisted —
and double-counted on the next valid event because the high-water mark was not
advanced. Fix: return an all-zero delta when any counter regresses, discarding
the entire event rather than a mix of zero and positive fields.
2. resolveCodexModel() accepted any openai:* string and any gpt-*codex* pattern,
but MODEL_PRICING only covers CODEX_MODEL_IDS. A project configured with e.g.
openai:gpt-5-codex would run through the Codex engine and then persist cost=0
because calculateCost('openai:gpt-5-codex', ...) returns 0. Fix: constrain the
resolver to CODEX_MODEL_IDS (with optional openai: prefix); remove the wildcard
gpt-*codex* branch entirely.
Tests added:
- codex-cost.test.ts Test 5b: mixed-regression scenario (inputTokens down, outputTokens
up) verifies all-zero delta and no double-counting on the following valid event.
- codex.test.ts: two new resolveCodexModel tests — openai:gpt-5-codex and
gpt-5-turbo-codex both throw rather than silently accepting unpriced models.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Restores support for openai:gpt-5.4-mini which was previously accepted via the old openai:* wildcard in resolveCodexModel but became inaccessible after the resolver was tightened to only allow models in CODEX_MODEL_IDS. Adds gpt-5.4-mini to CODEX_MODELS (models.ts) and a pricing row at input=$0.75, cachedInput=$0.075, output=$4.50 per 1M (from developers.openai.com/api/docs/models/gpt-5.4-mini, 2026-05-11). The existing llmMetrics-codex.test.ts coverage test automatically pins the new row — failing loudly if pricing is ever dropped. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…k-suite completion Two distinct production bugs surfaced on project ucho (2026-05-11) where the review agent failed to auto-dispatch after a PR's CI passed, both requiring manual review requests to unblock. Bug 1 (PR #394, MNG-683): check-suite-success deferred-on-incomplete returned a plain skip and relied on GitHub firing another check_suite event when the final suite finished. When the Actions API lagged webhook delivery, the API still showed the final suite as in_progress at query time AND no further webhook arrived — review never fired. Fix: schedule a deferredRecheck (30s, coalesce-keyed on owner/repo/PR/ head-SHA) so the trigger re-evaluates against fresh API state. Bug 2 (PR #393, MNG-691): stacked PR targeting feature/MNG-690-… was rejected by the base-branch gate even though cascade authored it. The gate exists to filter human or third-party-bot drive-bys, which the upstream persona check already handles. Fix: skip the base-branch gate for cascade-authored PRs; non-cascade authors still get the rejection. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…sful runs When codex's persistent bash session breaks with `write_stdin failed: stdin is closed`, the engine used to fail the run unconditionally — even when all real work was already captured. Prod evidence (cascade/MNG-718, run f801342b, 2026-05-11): an implementation agent opened PR #1350, ran the full verification suite, filed a follow-up friction ticket, and was still marked failed because the signal fired during session-close. This cost cascade the post-completion review dispatch and surfaced a misleading "agent failed" comment via aaight42. The discriminator is whether success evidence was captured BEFORE the signal: if `prUrl` is set and `finalOutput` is non-empty, the corruption was late and the work is real → log WARN, fall through to success. Otherwise the corruption may have masked the agent's work → fail loudly so ops retry (preserved behavior from PR #1245). Extracted classifyShellCorruption() to keep execute()'s cognitive complexity unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ent-body-file fix(scm): support body-file for update PR comment
…ging fix(codex): compute cost from token deltas, not nonexistent cost_usd
…rechecks Adds `recheckKind: 'check-suite'` to `TriggerResult.deferredRecheck` so the router stamps `checkSuiteRecheckAttempt` on the job (not `mergeabilityRecheckAttempt`). On the worker side, `processGitHubWebhook` now reschedules another coalesced delayed job when `isCheckSuiteRecheckJob=true` and the trigger is still stale, rather than silently dropping the dispatch via the `mergeability_recheck_exhausted` path. The recheck-handling logic is extracted into a `handleRecheckResult` helper to keep `processGitHubWebhook` within the lint complexity budget. Closes the silent-drop bug where a 30s check-suite recheck that found the Actions API still stale would never trigger review again. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…d-to-ci cross-process dedup Issue 1: CheckSuiteFailureTrigger used `gateCascadePersona ?? gateBaseBranch` which evaluated gateBaseBranch even for cascade-authored PRs. Cascade-authored stacked PRs targeting a non-base branch were incorrectly skipped instead of dispatching respond-to-ci. Fix: replace the `??` chain with an explicit gateCascadePersona check, mirroring the authorIsCascade bypass already present in decideCheckSuiteGates. Issue 2: The success handler's 30s deferred recheck fired in a fresh worker container with an empty in-process fixAttempts Map — it had no memory of what the router had already dispatched. This allowed duplicate respond-to-ci agents for the same PR+SHA. Fix: add Redis-backed dedup (respond-to-ci-dedup.ts) to dispatchRespondToCi, mirroring the review-dispatch-dedup pattern. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two gaps closed from the 2026-05-11 review (nhopeatall, PR #1351): 1. check-suite-failure.ts — the `defer` branch now returns a deferredRecheck (30s delay, recheckKind: 'check-suite') instead of a plain skip(), matching the API-lag fix already applied to check-suite-success. Repro: final check_suite.completed webhook with conclusion=failure arrives while Actions API still reports a check as in_progress; no follow-up webhook fires, so the plain skip would permanently lose respond-to-ci dispatch. 2. respond-to-ci-dedup.ts — TTL extended from 2 min to 10 min to cover the full duplicate window: 30s deferred-recheck delay + up to 5min worker-slot wait (slotWaitTimeoutMs default) + buffer. The 2-min key could expire before a recheck job even started under backlog, allowing the recheck to reclaim the key and launch a second fix agent for the same PR+SHA. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ate docs
Increases DEDUP_TTL_SEC from 10 min to 35 min to cover the full 30-minute
workerTimeoutMs window (+ 5-min buffer). The 10-min key could expire while
the original respond-to-ci job was still running, allowing a check-suite
recheck to claim the same PR+SHA and launch a duplicate fix agent.
Also updates AGENTS.md/CLAUDE.md, src/triggers/README.md, and
docs/architecture/{01,02,03,10} to document the two deferred-recheck
kinds (mergeabilityRecheckAttempt one-shot vs checkSuiteRecheckAttempt
safe-rescheduling), replacing stale descriptions that only mentioned the
mergeability path.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…fer-and-stacked-prs
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Routine dev → main promotion. 15 commits across 3 feature PRs + review-feedback follow-ups.
Feature PRs landing
cost_usdextraction; folds reasoning tokens into output for billing.--body-fileforupdate-pr-commentgadget (MNG-718).Review-feedback commits on top
c483f440codex: treat reasoning_output_tokens as output subset, not extra counter4de56aa0codex: return all-zero delta on regression; constrain resolver to priced modelse21c3393codex: add gpt-5.4-mini to CODEX_MODELS and MODEL_PRICING5d8e3c30check-suite: distinguish check-suite rechecks from mergeability rechecks438748eftriggers: failure gate bypass + respond-to-ci cross-process dedup76175c02triggers: failure defer recheck + extend dedup TTLe4b6929cextend respond-to-ci dedup TTL to cover full worker window + update docsb31dcddc,26747475)Risk
Largest change set since 2026-05-09. Mitigations:
🤖 Generated with Claude Code