Skip to content

Commit e948580

Browse files
authored
docs(examples): add CI-gate reference workflow (#39) (#83)
* docs(examples): add CI-gate reference workflow Add `examples/ci/evaluate_thresholds.yml` — a drop-in GitHub Actions workflow that gates every PR against the last 24 hours of production traces using four deterministic budgets (latency, token usage, tool error rate, turn count). Each gate runs as its own step, so a red PR status tells you which budget regressed. Also adds `examples/ci/README.md` with a quick-start checklist (copy the file, set two vars + one secret, swap four --agent-id values, tune four --threshold numbers) and a note on why the workflow pins `bigquery-agent-analytics>=0.2.2`. Companion to the Medium post "Your Agent Events Table Is Also a Test Suite." Readers who go from the post straight to the SDK repo now land on an authoritative copy of the workflow without having to chase a Gist. Ref: issue #77. * examples/ci: lower Token budget to 5000 + threshold-tuning comment Ship a smaller, reproducible demo number (matches the Calendar- Assistant demo in the companion blog post) with an inline comment telling readers to tune against their own --last=30d distribution. Production agents with longer prompts and multi-turn tool chains will want tens of thousands; a two-sentence prompt like the demo lands in the low thousands. (cherry picked from commit c625f2a)
1 parent 60e832d commit e948580

2 files changed

Lines changed: 120 additions & 0 deletions

File tree

examples/ci/README.md

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
# `examples/ci/`
2+
3+
Reference CI artifacts for agent quality gates backed by
4+
BigQuery Agent Analytics.
5+
6+
## `evaluate_thresholds.yml`
7+
8+
Drop-in GitHub Actions workflow that runs four deterministic
9+
budgets (latency, token usage, tool error rate, turn count) on
10+
every PR, scoring the last 24 hours of production traces from an
11+
`agent_events` BigQuery table. Exits non-zero when any session
12+
breaches its budget, so a bad merge lights up the PR status
13+
before code ships.
14+
15+
See the companion Medium post, *Your Agent Events Table Is Also a
16+
Test Suite*, for the narrative, threshold-setting guidance, and
17+
the companion categorical-eval gate that pairs naturally with
18+
this workflow.
19+
20+
### Quick start
21+
22+
1. Copy `evaluate_thresholds.yml` to `.github/workflows/` in
23+
your agent repo.
24+
2. Set repository variables `PROJECT_ID` and `DATASET_ID` to the
25+
GCP project + BigQuery dataset where your `agent_events` table
26+
lives.
27+
3. Set the repository secret `GCP_SA_KEY` to a service-account JSON
28+
with `bigquery.jobUser` + `bigquery.dataViewer` on the dataset.
29+
4. Replace `calendar_assistant` with your agent's name in all four
30+
`--agent-id` flags inside the workflow.
31+
5. Tune the four `--threshold` numbers against your own production
32+
distribution. A defensible starting point for each is "p95 of
33+
the last 30 days + 10% buffer"; revisit after week one of CI
34+
gating.
35+
36+
### Requirements
37+
38+
- `bigquery-agent-analytics >= 0.2.2` — earlier releases shipped
39+
normalized `1.0 - observed/budget` gate scoring with a `0.5`
40+
pass cutoff, which fires every gate at roughly half the budget
41+
the user typed. 0.2.2 switched to raw-budget binary gates so
42+
the `--threshold` value means what it says.
Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
# .github/workflows/evaluate_thresholds.yml
2+
#
3+
# Reference GitHub Actions workflow that gates every PR against the
4+
# last 24 hours of production traces stored in an `agent_events`
5+
# BigQuery table. Four deterministic budgets run as separate steps
6+
# so a red PR status tells you which gate regressed.
7+
#
8+
# Companion to the Medium post "Your Agent Events Table Is Also a
9+
# Test Suite." See the post for the narrative and for the sidebar
10+
# on picking initial threshold values from 30-day production data.
11+
#
12+
# Requires bigquery-agent-analytics >= 0.2.2 — the first release
13+
# with the raw-budget `--threshold` semantics and the tight
14+
# `--exit-code` failure output this workflow depends on.
15+
#
16+
# To adopt this workflow in your own agent repo:
17+
# 1. Copy this file to .github/workflows/evaluate_thresholds.yml.
18+
# 2. Set repo variables PROJECT_ID and DATASET_ID to the GCP
19+
# project + BigQuery dataset where your agent_events table
20+
# lives.
21+
# 3. Set the repo secret GCP_SA_KEY to a service account JSON
22+
# with bigquery.jobUser + bigquery.dataViewer on the dataset.
23+
# 4. Replace `calendar_assistant` with your agent's name in all
24+
# four --agent-id flags.
25+
# 5. Tune the four --threshold numbers against your own
26+
# production distribution. A defensible starting point for
27+
# each is "p95 of last 30 days + 10% buffer"; revisit after
28+
# week one of CI gating.
29+
30+
name: Agent quality gate
31+
32+
on:
33+
pull_request:
34+
paths:
35+
- 'agents/**'
36+
- 'prompts/**'
37+
38+
jobs:
39+
gate:
40+
runs-on: ubuntu-latest
41+
steps:
42+
- uses: actions/checkout@v4
43+
- uses: actions/setup-python@v5
44+
with: { python-version: '3.12' }
45+
- run: pip install 'bigquery-agent-analytics>=0.2.2,<0.3.0'
46+
- uses: google-github-actions/auth@v2
47+
with: { credentials_json: '${{ secrets.GCP_SA_KEY }}' }
48+
- name: Latency budget
49+
run: >
50+
bq-agent-sdk evaluate --evaluator=latency --threshold=5000
51+
--last=24h --agent-id=calendar_assistant --exit-code
52+
--project-id=${{ vars.PROJECT_ID }}
53+
--dataset-id=${{ vars.DATASET_ID }}
54+
- name: Token budget
55+
# Tune this to your agent's real token distribution. A short
56+
# system prompt + few-turn sessions will land in the low
57+
# thousands; production agents with longer instructions and
58+
# multi-turn tool chains typically want tens of thousands.
59+
# Run `bq-agent-sdk evaluate --evaluator=token_efficiency
60+
# --last=30d` without `--exit-code` once to see your own
61+
# baseline before picking a number.
62+
run: >
63+
bq-agent-sdk evaluate --evaluator=token_efficiency --threshold=5000
64+
--last=24h --agent-id=calendar_assistant --exit-code
65+
--project-id=${{ vars.PROJECT_ID }}
66+
--dataset-id=${{ vars.DATASET_ID }}
67+
- name: Tool error rate
68+
run: >
69+
bq-agent-sdk evaluate --evaluator=error_rate --threshold=0.1
70+
--last=24h --agent-id=calendar_assistant --exit-code
71+
--project-id=${{ vars.PROJECT_ID }}
72+
--dataset-id=${{ vars.DATASET_ID }}
73+
- name: Turn count
74+
run: >
75+
bq-agent-sdk evaluate --evaluator=turn_count --threshold=10
76+
--last=24h --agent-id=calendar_assistant --exit-code
77+
--project-id=${{ vars.PROJECT_ID }}
78+
--dataset-id=${{ vars.DATASET_ID }}

0 commit comments

Comments
 (0)