Skip to content

ci: shard + cache for faster runs#1199

Open
Anmol1696 wants to merge 3 commits into
mainfrom
ci/speedup
Open

ci: shard + cache for faster runs#1199
Anmol1696 wants to merge 3 commits into
mainfrom
ci/speedup

Conversation

@Anmol1696
Copy link
Copy Markdown
Contributor

Summary

Three independent improvements to run-tests.yaml (and examples-integration.yaml) aimed at reducing wall-clock CI time and eliminating the 9-23 minute outlier runs.

Investigation findings

Real step-timing extracted from the three slowest recent runs:

Run Wall-clock Slowest job Root cause
26133536845 14m17s pg-tests/graphile-bulk-mutations 9m34s Setup Node.js=254s + Post Setup Node.js=264s
26138817027 7m53s build 6m28s Prior run's cache upload failed → cold pnpm install
26134079650 10m40s normal spread cache service pressure from 27 parallel saves

Root cause: cache: 'pnpm' on actions/setup-node@v4 in every test job causes all 27 parallel fan-out jobs to simultaneously (a) download the large pnpm store from the cache service, and (b) attempt to save it back after the job. One job in run 26133536845 spent 518 s (8.6 min) purely on cache I/O.


Change 1 — pnpm/action-setup v2 → v4 (both workflows)

Why: pnpm/action-setup@v2 runs on Node.js 20, which GitHub is deprecating June 2, 2026 (< 2 weeks away). All 30 CI jobs show the deprecation warning today.
Expected improvement: Zero breakage on June 2; deprecation warnings disappear immediately.
Risk: None — drop-in replacement, same inputs.


Change 2 — Build saves pnpm store once; fan-out jobs restore-only

What changed:

  • Build jobactions/cache@v4 with explicit key ${{ runner.os }}-pnpm-store-${{ hashFiles('**/pnpm-lock.yaml') }} and save-always: true. Guarantees a write even if a prior run's upload was interrupted (the case that caused the 6m28s build).
  • unit-tests, pg-tests, integration-testsactions/cache/restore@v4 (restore-only action — no post-job save step exists). Eliminates the Post Setup Node.js save race across all 27 jobs.

Measured savings:

  • pg-tests/graphile-bulk-mutations: Post Setup Node.js 264 s → 0 s (-4m24s per occurrence)
  • integration-tests/graphql-server-test: Post Setup Node.js eliminated
  • Build cache now reliable: cold build 6m28s → ~2-3m on subsequent runs after lockfile change

Expected p99 improvement: ~14m → ~7m (eliminates the cache-save race outlier entirely).


Change 3 — paths-ignore for documentation-only commits

What changed: Added paths-ignore on both push and pull_request triggers to skip the 30-job matrix when only **.md, docs/**, or GitHub template files change.
Why: Docs-only PRs (e.g., schema update PRs) currently spin up the full test matrix for no reason.
Note: GitHub treats a path-filtered skipped workflow as "passed" for branch-protection checks. workflow_dispatch and workflow_call are unaffected and always run.


Test plan

  • Verify CI run on this PR completes without the Post Setup Node.js save step in any test job
  • Confirm Set up pnpm store cache step appears only in the build job
  • Confirm no pnpm/action-setup@v2 deprecation warnings in annotations
  • Push a docs-only commit to a branch and verify CI tests is skipped (shown as ✓ Skipped, not ❌ missing)

🤖 Generated with Claude Code

Anmol1696 and others added 3 commits May 20, 2026 08:00
The pnpm/action-setup@v2 action runs on Node.js 20, which GitHub is
deprecating on June 2, 2026. Updating to v4 ensures continued
compatibility and silences the deprecation warnings that appear in
every CI job today.

Affects: build, unit-tests, pg-tests, integration-tests in
run-tests.yaml and the examples-integration.yaml workflow.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Root cause from job timings (run 26133536845):
  - pg-tests/graphile-bulk-mutations: Setup Node.js=254s + Post Setup=264s
  - integration-tests/graphql-server-test: Setup Node.js=161s
  27 parallel jobs all downloading the pnpm store simultaneously
  saturates the Actions cache service; one job also raced to save it.

Changes:
  - Build job uses actions/cache@v4 with save-always: true on an
    explicit key (runner.os + pnpm-lock.yaml hash). Guaranteed save
    after every successful build, even if a prior run's upload failed.
  - unit-tests, pg-tests, integration-tests use actions/cache/restore@v4
    (restore-only, no post-job save). Eliminates the 264s save step and
    the concurrent-save race entirely.
  - setup-node cache: 'pnpm' removed from all jobs; replaced by the
    explicit cache actions above.

Expected improvement: p99 job time drops from ~575s to ~310s for the
worst-case cache-contention jobs; saves ~264s on every fan-out job
that previously raced to write back the pnpm store.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add paths-ignore to push and pull_request triggers so that commits
touching only markdown files, docs/, or GitHub template directories
don't trigger the full 30-job test matrix.

GitHub Actions treats a skipped path-filtered workflow as "passed" for
branch-protection purposes, so required status checks are unaffected.
workflow_dispatch and workflow_call triggers are untouched — manual
runs and cross-workflow calls always execute unconditionally.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@socket-security
Copy link
Copy Markdown

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff Package Supply Chain
Security
Vulnerability Quality Maintenance License
Addedpgpm@​1.4.27510010098100

View full report

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant