docs(examples): add CI-gate reference workflow (#39) (#83)

caohy1988 · web-flow · commit e9485808a8d0 · 2026-04-25T10:02:20.000-07:00
* docs(examples): add CI-gate reference workflow Add `examples/ci/evaluate_thresholds.yml` — a drop-in GitHub Actions workflow that gates every PR against the last 24 hours of production traces using four deterministic budgets (latency, token usage, tool error rate, turn count). Each gate runs as its own step, so a red PR status tells you which budget regressed. Also adds `examples/ci/README.md` with a quick-start checklist (copy the file, set two vars + one secret, swap four --agent-id values, tune four --threshold numbers) and a note on why the workflow pins `bigquery-agent-analytics>=0.2.2`. Companion to the Medium post "Your Agent Events Table Is Also a Test Suite." Readers who go from the post straight to the SDK repo now land on an authoritative copy of the workflow without having to chase a Gist. Ref: issue #77. * examples/ci: lower Token budget to 5000 + threshold-tuning comment Ship a smaller, reproducible demo number (matches the Calendar- Assistant demo in the companion blog post) with an inline comment telling readers to tune against their own --last=30d distribution. Production agents with longer prompts and multi-turn tool chains will want tens of thousands; a two-sentence prompt like the demo lands in the low thousands. (cherry picked from commit c625f2a)
diff --git a/examples/ci/README.md b/examples/ci/README.md
@@ -0,0 +1,42 @@
+# `examples/ci/`
+
+Reference CI artifacts for agent quality gates backed by
+BigQuery Agent Analytics.
+
+## `evaluate_thresholds.yml`
+
+Drop-in GitHub Actions workflow that runs four deterministic
+budgets (latency, token usage, tool error rate, turn count) on
+every PR, scoring the last 24 hours of production traces from an
+`agent_events` BigQuery table. Exits non-zero when any session
+breaches its budget, so a bad merge lights up the PR status
+before code ships.
+
+See the companion Medium post, *Your Agent Events Table Is Also a
+Test Suite*, for the narrative, threshold-setting guidance, and
+the companion categorical-eval gate that pairs naturally with
+this workflow.
+
+### Quick start
+
+1. Copy `evaluate_thresholds.yml` to `.github/workflows/` in
+   your agent repo.
+2. Set repository variables `PROJECT_ID` and `DATASET_ID` to the
+   GCP project + BigQuery dataset where your `agent_events` table
+   lives.
+3. Set the repository secret `GCP_SA_KEY` to a service-account JSON
+   with `bigquery.jobUser` + `bigquery.dataViewer` on the dataset.
+4. Replace `calendar_assistant` with your agent's name in all four
+   `--agent-id` flags inside the workflow.
+5. Tune the four `--threshold` numbers against your own production
+   distribution. A defensible starting point for each is "p95 of
+   the last 30 days + 10% buffer"; revisit after week one of CI
+   gating.
+
+### Requirements
+
+- `bigquery-agent-analytics >= 0.2.2` — earlier releases shipped
+  normalized `1.0 - observed/budget` gate scoring with a `0.5`
+  pass cutoff, which fires every gate at roughly half the budget
+  the user typed. 0.2.2 switched to raw-budget binary gates so
+  the `--threshold` value means what it says.
diff --git a/examples/ci/evaluate_thresholds.yml b/examples/ci/evaluate_thresholds.yml
@@ -0,0 +1,78 @@
+# .github/workflows/evaluate_thresholds.yml
+#
+# Reference GitHub Actions workflow that gates every PR against the
+# last 24 hours of production traces stored in an `agent_events`
+# BigQuery table. Four deterministic budgets run as separate steps
+# so a red PR status tells you which gate regressed.
+#
+# Companion to the Medium post "Your Agent Events Table Is Also a
+# Test Suite." See the post for the narrative and for the sidebar
+# on picking initial threshold values from 30-day production data.
+#
+# Requires bigquery-agent-analytics >= 0.2.2 — the first release
+# with the raw-budget `--threshold` semantics and the tight
+# `--exit-code` failure output this workflow depends on.
+#
+# To adopt this workflow in your own agent repo:
+#   1. Copy this file to .github/workflows/evaluate_thresholds.yml.
+#   2. Set repo variables PROJECT_ID and DATASET_ID to the GCP
+#      project + BigQuery dataset where your agent_events table
+#      lives.
+#   3. Set the repo secret GCP_SA_KEY to a service account JSON
+#      with bigquery.jobUser + bigquery.dataViewer on the dataset.
+#   4. Replace `calendar_assistant` with your agent's name in all
+#      four --agent-id flags.
+#   5. Tune the four --threshold numbers against your own
+#      production distribution. A defensible starting point for
+#      each is "p95 of last 30 days + 10% buffer"; revisit after
+#      week one of CI gating.
+
+name: Agent quality gate
+
+on:
+  pull_request:
+    paths:
+      - 'agents/**'
+      - 'prompts/**'
+
+jobs:
+  gate:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - uses: actions/setup-python@v5
+        with: { python-version: '3.12' }
+      - run: pip install 'bigquery-agent-analytics>=0.2.2,<0.3.0'
+      - uses: google-github-actions/auth@v2
+        with: { credentials_json: '${{ secrets.GCP_SA_KEY }}' }
+      - name: Latency budget
+        run: >
+          bq-agent-sdk evaluate --evaluator=latency --threshold=5000
+          --last=24h --agent-id=calendar_assistant --exit-code
+          --project-id=${{ vars.PROJECT_ID }}
+          --dataset-id=${{ vars.DATASET_ID }}
+      - name: Token budget
+        # Tune this to your agent's real token distribution. A short
+        # system prompt + few-turn sessions will land in the low
+        # thousands; production agents with longer instructions and
+        # multi-turn tool chains typically want tens of thousands.
+        # Run `bq-agent-sdk evaluate --evaluator=token_efficiency
+        # --last=30d` without `--exit-code` once to see your own
+        # baseline before picking a number.
+        run: >
+          bq-agent-sdk evaluate --evaluator=token_efficiency --threshold=5000
+          --last=24h --agent-id=calendar_assistant --exit-code
+          --project-id=${{ vars.PROJECT_ID }}
+          --dataset-id=${{ vars.DATASET_ID }}
+      - name: Tool error rate
+        run: >
+          bq-agent-sdk evaluate --evaluator=error_rate --threshold=0.1
+          --last=24h --agent-id=calendar_assistant --exit-code
+          --project-id=${{ vars.PROJECT_ID }}
+          --dataset-id=${{ vars.DATASET_ID }}
+      - name: Turn count
+        run: >
+          bq-agent-sdk evaluate --evaluator=turn_count --threshold=10
+          --last=24h --agent-id=calendar_assistant --exit-code
+          --project-id=${{ vars.PROJECT_ID }}
+          --dataset-id=${{ vars.DATASET_ID }}