chore: release - merge dev into main#1293
Merged
Merged
Conversation
* test(triggers): expand trigger contract conformance coverage * fix(test): address review feedback on trigger conformance fixtures and guard - Mirror production `allPassing` predicate in `createCheckSuiteStatus`: require at least one check and treat success/skipped/neutral as passing, matching `githubClient.getCheckSuiteStatus()` exactly. The neutral conformance case now correctly yields `allPassing: true`. - Extend the raw-trigger-literal guard regex to catch backtick template literals in addition to single- and double-quoted strings, closing the syntax-level bypass identified in the review. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Cascade Bot <bot@cascade.dev> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
…er failure table - src/triggers/README.md: buildDeferredRecheckResult authoring guidance now explains that bare re-dispatch is a GitHub-only contract today. GitHub's buildJob() strips triggerResult and sets mergeabilityRecheckAttempt; non-GitHub adapters (Trello, JIRA, Linear, Sentry) always embed triggerResult, so their workers use resolveTriggerResult() with the pre-resolved result and skip registry dispatch. A non-GitHub handler returning buildDeferredRecheckResult would schedule a job that reuses the same agentType: null result. - docs/architecture/02-webhook-pipeline.md: split the combined Redis failure row into two distinct rows. PM immediate-dispatch and coalesced-dispatch failures call onBlocked and return "Failed to enqueue/schedule job to Redis". Deferred re-check schedule failures capture Sentry under deferred_recheck_schedule_failure, skip onBlocked, and still return "Deferred re-check scheduled" (same as success). Updated the Coalescing and deferred re-check prose to match. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…re docs The "Deferred re-checks" section in 03-trigger-system.md and the "Deferred re-check exhaustion" section in 10-resilience.md both described worker re-dispatch as a generic contract, but only GitHubRouterAdapter.buildJob() strips triggerResult and sets mergeabilityRecheckAttempt. Non-GitHub adapters embed triggerResult so resolveTriggerResult() short-circuits, and the registry is never re-invoked. Updated both paragraphs to mirror the GitHub-only qualification already present in src/triggers/README.md. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…all active contributor docs The builder summary table, TriggerResult JSDoc, and CLAUDE.md all described deferred re-check as a generic re-dispatch contract. In practice, bare re-dispatch only works for GitHub: GitHubRouterAdapter.buildJob() strips triggerResult and sets mergeabilityRecheckAttempt, while non-GitHub adapters embed the pre-resolved result so resolveTriggerResult() short-circuits without registry dispatch. Update all three surfaces to state this constraint clearly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rately The paragraph added in this PR overstated the compensation invariant by implying all pre-worker failures flow through BullMQ `failed` → `releaseLocksForFailedJob`. That path only applies post-enqueue. Redis enqueue/schedule failures (`addJob` / `scheduleCoalescedJob` throwing before `markImmediateDispatchEnqueued` / `markCoalescedDispatchEnqueued`) have no BullMQ job and therefore no `failed` event. Lock protection on that path comes from the success-first ordering contract: marks happen only after a successful enqueue, so a Redis failure leaves no lock to release. `onBlocked()` is the inline compensation. Split the single paragraph into two distinct boundaries — enqueue/schedule failures and post-enqueue dispatch failures — consistent with the router outcomes table in docs/architecture/02-webhook-pipeline.md. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ndary - Update trigger-resolution.ts "Used By" column from "Sentry (GitHub and PM use inline logic)" to "GitHub, PM, Sentry" — all three worker handlers call resolveTriggerResult() (PM: webhook-handler.ts:69, GitHub: webhook-handler.ts:230, Sentry: webhook-handler.ts:53). - Replace "[inline] if triggerResult → use it, else dispatchTrigger(...)" in the GitHub flow diagram with the actual call: resolveTriggerResult(registry, ctx, triggerResult) → result. - Amend "Every failed dispatch path flows through the BullMQ failed event" to "Post-enqueue dispatch failures … flow through the BullMQ failed event" with a forward reference to the split below, removing the internal contradiction with the two-boundary section added earlier. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ration surface)
Closes the silent-skip path for Sentry's "Internal Integration" / Custom
Webhook surface. Prod 2026-05-09: a wedged-lock-canary alert fired in the
cascade Sentry project, the alerting agent was enabled, but no agent ran.
Webhook log id `fbdc6d87-b962-444c-8a2a-a9452a74ff71` shows
`processed=false, decisionReason="Event unparseable or not processable"` —
the trigger was never invoked.
Root cause: `src/router/adapters/sentry.ts:31` whitelisted only
`['event_alert', 'metric_alert']` (Sentry Alert Rule surfaces). The webhook
arrived with `Sentry-Hook-Resource: issue` (Internal Integration default
surface — the natural way users wire Sentry → cascade). Spec 019 was
scoped to event_alert; the issue-lifecycle path was deferred and never
landed. Users who configured Sentry the natural way got silent skips for
every issue.
This adds end-to-end support, mirroring the event_alert pattern:
- Router adapter accepts `'issue'` resource + new test asserting it parses.
- New `SentryIssueLifecycleTrigger` (matches `resource: 'issue'` +
`action: 'created'`) fires the alerting agent. Resolved/archived/etc.
actions are deferred (would auto-close the cascade card; out of scope).
- Distinct AlertSource literal `'sentry-issue'` so the
`(project_id, external_source, external_id)` partial-unique index on
`pr_work_items` doesn't collide if the same Sentry issue arrives via
both surfaces (event_alert and issue) — each surface materializes its
own card.
- New `formatSentryIssueLifecycleCardBody` builds AlertHints from
`data.issue.{title, web_url, level, shortId, culprit, metadata.{filename, function}}`.
Mirror of `formatSentryCardBody` adapted to the issue-lifecycle payload shape.
- Worker-side `processSentryWebhook` extends the materializer dispatch with
a third branch keyed on `agentInput.triggerEvent === 'alerting:issue-lifecycle'`.
Same AlertSlotMissingError graceful-skip + transient-PM-error retry semantics
as the existing two branches.
- `SentryIssuePayload` type updated to match the actual Sentry webhook shape
(nested `data.issue.{...}` instead of flattened `data.{...}` — the captured
prod fixture confirmed the existing type was wrong).
Drive-by lint cleanup (per request): refactored `materializeAlertWorkItem`
to extract `reuseOrLazyHealMapping` and `pollForConcurrentWinner` helpers,
bringing the parent function under the cognitive-complexity ceiling.
Verification:
- 9112 unit tests passing (3 new test files: issue-lifecycle-format,
issue-lifecycle handler, plus extensions to sentry-webhook-handler and
the router/adapters/sentry tests). Captured live prod fixture used as
the regression baseline.
- Lint clean (0 errors, 0 warnings).
- Typecheck clean.
Operator notes:
- Cascade project's PM `lists.alerts` (Trello) / `statuses.alerts`
(JIRA, Linear) must be configured for materialisation to actually
create a card. The pre-flight validation rule at
`src/triggers/shared/integration-validation.ts` already emits a
`pm`-category error when alerting is enabled but the slot is unset —
unchanged; same message will fire for the new `'sentry-issue'` source.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tecture-contracts
…inition Adds `alerting:issue-lifecycle` to `src/agents/definitions/alerting.yaml` so `getResolvedTriggerConfig` can find the event. Without this entry the lookup returned `null`, causing `checkTriggerEnabledWithParams` to always see `enabled=false` — leaving the prod incident path effectively unfixed even though the router now accepts `resource: issue` webhooks. Also adds `TRIGGER_EVENTS.ALERTING.ISSUE_LIFECYCLE` to the events catalog and updates the handler + webhook-handler to use the constant (required by the trigger-event-string consistency static guard for new handlers). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…lifecycle-trigger
…+ shell-state corruption Closes the silent-failure path identified by prod run d8e31665 (cascade implementation, 2026-05-09): the agent created PR #1290 successfully but the run was marked failed with "Agent completed but no authoritative PR creation was recorded" — and operators had zero signal of why. Three independent buggy layers compounded; fix all three so any one being robust catches the failure mode. ## Layer 1 — gadget output bloat (the trigger; cascade-specific) Cascade's lefthook pre-push runs `npm run test:fast` (159 files / 2981 tests / ~34 s / ~97 KB of output). That output was captured into the gadget result's `pushOutput` field by `cascade-tools scm create-pr`. Codex's tool-result parser couldn't reliably extract the JSON envelope buried under that volume, retried the call, and the resulting concurrent invocations raced against the same sidecar path leaving prUrl missing. Fix: truncate `pushOutput` and `commitOutput` at 4 KB at the gadget result boundary. Full hook output already streams through `runCommand`'s heartbeat into `LLMIST_LOG_FILE` for operator visibility — only the agent-visible result-stream copy is capped. New head + truncation-marker + tail format preserves both ends of the output for context. ## Layer 2 — sidecar contract gap (the fault; provider-agnostic) `writePRSidecar` accepted `prUrl: string` but never validated. If the caller passed undefined for any reason, `JSON.stringify({prUrl: undefined, ...})` silently omitted the key and produced a half-baked sidecar. Downstream hydration reported "missing required fields { hasPrUrl: false }" with no signal of what the writer actually persisted. Fix: validate at the write boundary. Refuse to write + log ERROR with full input context when prUrl is missing/non-string or prNumber is non-finite. Same hardening for writeReviewSidecar's reviewUrl. Caller bugs now produce a precise log entry instead of a silent contract violation. ## Layer 3 — diagnostic gap (operator visibility) `hydratePrSidecar` logged "PR sidecar missing required fields { hasPrUrl: false }" with no further context — no path, no parsed keys, no actual values. Operators hit Loki archaeology to determine whether the file was empty, malformed, or a valid JSON without prUrl. Fix: hydratePrSidecar now re-reads the raw payload and dumps `sidecarPath`, `rawSidecarKeys`, `rawSidecarPrUrl`, `rawSidecarPrNumber` so a single log line distinguishes empty-file / valid-JSON-without-prUrl / malformed-JSON. postProcessResult emits a Sentry capture under tag `pr_sidecar_invalid` (with engine + agentType + identifier extras) when authoritative PR evidence is missing — gives ops a single dashboard for prod frequency. ## Layer 4 — codex shell-state corruption (the propagation; codex-specific) Codex's persistent bash session breaks with `ERROR codex_core::tools::router: error=write_stdin failed: stdin is closed for this session`. Once that signal fires, every subsequent command in the run inherits a stale stdout buffer (we observed lint output from one command bleeding into a later post-comment call, plus sidecar writes racing). Codex itself exits cleanly (exit=0) — the stderr signal is the only evidence the session corrupted. Fix: detect the signal in stderr at run completion. When present + exit=0, return `success: false` with explicit error "codex shell-state corrupted: write_stdin failed (stdin closed for session)" + log ERROR. Ops retry against a fresh codex session instead of trusting potentially-corrupted state. Conservative: false-positive case (benign stderr) pinned by a "passes through cleanly when stderr is benign" regression test. ## Tests - `tests/unit/gadgets/github/core/createPR.test.ts` (+3 cases) — 97 KB pushOutput → ~4 KB with marker; 50 KB commitOutput same; small output unchanged. - `tests/unit/gadgets/session/core/sidecar.test.ts` (+4 cases) — refuses empty prUrl; refuses non-string prUrl; refuses non-finite prNumber; refuses empty reviewUrl. Each pairs an ERROR log assertion with a no-write filesystem assertion. - `tests/unit/backends/sidecarManager.test.ts` (+2 cases) — diagnostic WARN includes rawSidecarKeys + rawSidecarPrUrl for valid-JSON-without-prUrl case AND for empty-file case. - `tests/unit/backends/postProcess.test.ts` (+2 cases) — Sentry capture fires under tag `pr_sidecar_invalid` when authoritative PR evidence is missing; does NOT fire on the happy path. - `tests/unit/backends/codex.test.ts` (+2 cases) — codex shell-state signal in stderr → success=false + ERROR log; benign stderr does not regress. ## Verification - `npm test` — 9141 / 9141 passing - `npm run lint` — clean - `npm run typecheck` — clean - All 4 fixes are additive; no behavioral change on healthy paths Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`readRawPRSidecar` returned `null` for all three failure modes (missing file, empty/whitespace file, malformed JSON), so every 'PR sidecar missing required fields' WARN emitted the same `rawSidecarKeys: null` shape — operators could not tell the cases apart without Loki archaeology. Replace the nullable return with a discriminated union (`RawSidecarDiagnostic`) with four statuses: 'missing', 'empty', 'malformed', 'parsed'. Each case carries its own supplementary fields: - 'empty': rawByteLength (confirms zero-byte truncation vs. missing) - 'malformed': rawByteLength + parseError + rawPreview (shows how far the write got and what the JSON syntax error was) - 'parsed': full key list + prUrl/prNumber values (the original Layer-3 signal) - 'missing': status alone (file was never written) Update the two existing regression tests to assert the new `rawSidecarStatus` field. Add a third test for the truncated-mid-write (malformed) case to pin that it is distinguishable from the empty-file case. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ailure Extend the 4 KB truncation to the error paths of `stageAndCommit` and `pushBranch`. Previously, only the success-path `commitOutput`/`pushOutput` fields were capped; a pre-commit or pre-push hook failure that emits 97+ KB of test output still embedded the full output in `err.message`, which `createCLICommand` serialises into the JSON error envelope — leaving the same Codex parser/retry bloat path open. Full hook output stays in the worker engine log (LLMIST_LOG_FILE) for operator visibility. Adds two regression tests pinning the failure-path truncation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Automated release PR created by the release workflow.
Commits (17):