[safe-output-health] Safe Output Health Report - 2026-04-10 #25653
Closed
Replies: 1 comment
-
|
This discussion has been marked as outdated by Safe Output Health Monitor. A newer discussion is available at Discussion #25811. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Executive Summary
failureOverall Safe Output Success Rate: 85.7% (24/28 runs where safe_outputs job ran to completion)
Safe Output Job Statistics
add_commentcreate_pull_request_review_commentadd_labelssubmit_pull_request_reviewcreate_discussionnoopcreate_issueupload_assetupdate_pull_requestpush_to_pull_request_branchupload_artifactdispatch_workflowset_issue_typesend_slack_messagepost_slack_messagecreate_code_scanning_alertadd_reviewercreate_pull_requestupdate_issueresolve_pull_request_review_threadreply_to_pull_request_review_commentremove_labelsassign_to_agentmissing_toolError Clusters
Cluster 1:
upload_artifact— Staging Artifact Not Foundupload_artifactSample Error:
Root Cause: The Smoke Claude agent called
resolve_pull_request_review_threadwiththread_id: "PRRT_kwDOPc1QR856IZIQ"(a thread on PR #25637). The GitHub API returnedResource not accessible by integration, indicating the GITHUB_TOKEN or app credential used in the safe_outputs job lacks permission to resolve review threads. This is likely a scope/permission limitation of the token.Impact: Review thread was not resolved. Safe_outputs job failed. Other safe outputs (create_issue, add_comment, push_to_pull_request_branch, etc.) were all executed successfully.
Additional Observations (Non-Failure)
Handled Git Push Fallback (Q workflow)
Run: §24242379787
The Q workflow's
push_to_pull_request_branchhit a GraphQL signed-commit failure (permission error), fell back togit push, which also failed. The safe output handler gracefully fell back to creating review issue #25634. Safe_outputs conclusion wassuccess. This is working as designed.Warning-Level Issues (Not Failures)
Run 24243815112 (Smoke Copilot) also logged:
Cannot reply to review comment: not running in a pull request context—reply_to_pull_request_review_commentcalled from a scheduled run, handled as warningNo review context set - cannot submit review—submit_pull_request_reviewpending review had no context, handled as warningRoot Cause Analysis
Infrastructure Issues
upload_artifactstaging gap: The two Smoke Copilot failures both involved the agent building and attempting to upload thegh-awbinary viaupload_artifact. The staging mechanism (safe-outputs-upload-artifacts) was absent from both runs. The agent job completed successfully (agent: success) and the binary was built (evidence: the smoke test report confirms "Build gh-aw ✅"). The staging upload is likely a separate step that should run post-agent to pre-stage binaries for the safe_outputs job, but it isn't running. This may be a missing step in the Smoke Copilot workflow configuration.Logic/Context Issues
update_issuecontext resolution bug: The handler rejectsupdate_issuewith an explicitissue_numberwhen the event context is not an issue. For scheduled workflows that maintain a persistent "dashboard" issue (like Workflow Health Manager), this is the standard use pattern. The handler should allowupdate_issuewhen an explicit numericissue_numberis provided, regardless of trigger event. This is a recurring issue (5 total occurrences across March–April).API Permission Issues
resolve_pull_request_review_threadpermission: TheresolveReviewThreadGraphQL mutation requires specific GitHub App scopes. The integration token used in the safe_outputs job apparently lacks this permission. This may be a scope that was recently added to the Smoke Claude test coverage without verifying the token supports it.Recommendations
Critical Issues (Immediate Action Required)
Fix
upload_artifactstaging pipeline in Smoke Copilotsafe-outputs-upload-artifacts) is produced before safe_outputs runsgh-awbinary for upload. If missing, add a step to upload the binary to thesafe-outputs-upload-artifactsartifact before the safe_outputs job runs.Fix
update_issueexplicit-number context handlingupdate_issuecalls as requiring triggering-event context, even when an explicitissue_numberis providedupdate_issuehandler, whenissue_numberis a specific number (not"triggering"), skip the event context check and operate on that number directly.Bug Fixes Required
resolve_pull_request_review_threadgraceful degradationresolve_pull_request_review_threadsafe output handlerResource not accessible by integrationerror and degrade to a warning instead of a hard failure. Log the limitation and continue processing remaining messages.resolve_pull_request_review_threadProcess Improvements
Smoke test coverage validation: Before adding a new safe output type to smoke tests, verify the token/app in the safe_outputs job supports the required API permission. Add a permission check step to smoke test setup.
Staging artifact documentation: Document the
upload_artifactstaging mechanism — which step producessafe-outputs-upload-artifacts, when it runs, and what's required. Add a pre-check in the safe_outputs job that warns clearly when the staging artifact is missing.Work Item Plans
Work Item 1: Fix
upload_artifactstaging in Smoke Copilotgh-awbinary during the agent job and requests anupload_artifactsafe output. However, the staging artifact (safe-outputs-upload-artifacts) is never produced, causing the safe_outputs job to fail forupload_artifact.upload_artifactsucceeds in next scheduled runsuccessfor Smoke Copilotgh-awbinary to thesafe-outputs-upload-artifactsartifact. If absent, add it. If present, investigate why it's not running.Work Item 2: Fix
update_issuewith explicitissue_numberon scheduled workflowsupdate_issuehandler rejects calls whereissue_numberis an explicit number (e.g.,"25470") when the workflow trigger is a schedule event, because it interprets allupdate_issuecalls as requiring triggering-event context. This prevents the Workflow Health Manager from updating its dashboard issue.issue_numberis a specific number (not"triggering"), handler resolves directly to that issueupdate_issuewithtarget: "triggering"on schedule-triggered runs still correctly fails with warningupdate_issuehandler, add a check: ifissue_numberis a numeric ID (not the literal string"triggering"), bypass the event context requirement and operate on the explicit number.Work Item 3: Graceful degradation for
resolve_pull_request_review_threadpermission errorsresolve_pull_request_review_threadreceives aResource not accessible by integrationAPI error, it causes a hard failure of the entire safe_outputs job. This single operation blocking all other safe outputs is disproportionate.Resource not accessible by integrationerrors inresolve_pull_request_review_threadare treated as warnings, not errorsResource not accessible by integration), emit a##[warning]and continue instead of failing.Historical Context
Trend: Safe Output Failure Rate (Last 12 Days)
Trends:
update_issue/add_commenton non-issue-context runs (5 total occurrences across Mar 29–Apr 10)upload_artifactstaging missing (2×),resolve_pull_request_review_threadpermission error (1×)push_to_pull_request_branchfile-outside-allowed-list pattern from Apr 1–2 has not recurred (Smoke Claude appears to have fixed its file naming)Metrics and KPIs
add_comment,add_labels,submit_pull_request_review,create_discussion,create_issue— all 100% success todayupload_artifact— 0/2 (0% success today)Next Steps
upload_artifactworkflowupdate_issueexplicitissue_numberhandler to bypass event context checkresolve_pull_request_review_threadfor permission errorsReferences:
Beta Was this translation helpful? Give feedback.
All reactions