Skip to content

Commit 1185e9b

Browse files
feat: Add agent improvement cycle demo (#61)
* feat: Add agent improvement cycle demo and quality report enhancements Add --app-name and --output-json flags to quality_report.py for filtering sessions by agent and producing structured JSON output. Add a complete demo showing the eval -> analyze -> improve cycle with a company info agent that starts with intentional prompt flaws and self-corrects across multiple cycles using LLM-as-a-judge quality evaluation. * fix: Vertex AI auth, prompt sanitization, and orchestrator resilience Configure improver to use Vertex AI (matching the agent), sanitize Gemini-generated summaries to prevent Python syntax errors from multiline comments, add retry on syntax errors, and reduce quality report limit to avoid stale session accumulation across cycles. * Redesign v1 prompt for dramatic quality gap, rewrite README Make v1 actively discourage tool use ("answer from knowledge above, deflect unknowns to HR") so 7/10 eval cases reliably fail. Redesign eval cases to cover expenses, benefits, holidays, and date handling. Rewrite README to emphasize that the cycle learns from real production sessions logged via BigQueryAgentAnalyticsPlugin, not just static evals, and that new eval cases are derived from actual field failures. * Improve README clarity, rename env vars, add quickstart examples * Fix path resolution for standalone SDK checkout REPO_ROOT was 3 levels up (../../..) which assumed the demo lived inside agent-operations/src/. In a standalone SDK checkout the demo is at examples/agent_improvement_cycle/, so 2 levels up is correct. Also fixes .env path in agent.py and improve_agent.py and the quality_report.py path in run_cycle.sh. * Use local .env inside demo directory instead of SDK root * Remove google-adk[bigquery] extra that does not exist in ADK 1.31+ * Enable required APIs in setup, document IAM roles * Suppress pip script-location warnings in setup * Fix .env: add PROJECT_ID and DATASET_LOCATION, source in run_cycle.sh * Explain how eval runs locally via ADK InMemoryRunner * Use PROJECT_ID consistently, drop GOOGLE_CLOUD_PROJECT from .env * Add --threshold flag for configurable unhelpful rate warning * Address PR review: add LLM prompt validation, schema checks, retry logic - Fix --output-json stdout corruption by writing status line to stderr - Add --output-json help note clarifying it ignores --samples cap - Add --app-name help note about root_agent_name requirement - Add LLM-based prompt validation via second Gemini call comparing original vs improved prompt for content preservation and coherence - Add eval case schema validation (required: id, question, category, expected_tool) with skip-and-log on malformed cases - Replace fixed sleep 5 with retry loop (up to 6 attempts with backoff) for BigQuery write propagation - Remove unused UUID generation in run_eval.py - Soften reproducibility claim to acknowledge LLM non-determinism - Add type annotations to all functions in improve_agent.py and run_eval.py - Update README with Guardrails section and expanded Step 3 docs - Update DEMO_SCRIPT.md with validation talking points - Update run_cycle.sh comments to document validation steps * Replace LLM validation with golden eval gate and synthetic traffic generation Replace the LLM-based prompt sanity check with a behavioral regression gate: candidate prompts are tested against the full golden eval set using a throwaway agent before being accepted. Add Gemini-powered synthetic traffic generation (generate_traffic.py) that produces diverse user questions each cycle, distinct from the golden set. Failed synthetic cases are extracted and added to the golden eval set so regressions are caught in future cycles. Update run_cycle.sh to a 4-step flow (generate → run → evaluate → improve), and rewrite README.md and DEMO_SCRIPT.md accordingly. * Add --golden eval mode, fix after-improvement measurement, trim golden set to 3 - Add --golden flag to run_eval.py: runs eval cases through a throwaway agent with LLM judge (no BQ logging) for immediate pass/fail scoring - Fix step 5 measurement: use --golden instead of BQ query to avoid evaluating stale sessions from previous runs - Re-run same synthetic traffic (not new) for fair before/after comparison - Show golden eval set growth after improvement step - Trim golden set to 3 V1-passing cases; failures discovered by cycle - Default traffic count: 15 -> 10 - Remove response truncation in eval output * Fix reset mechanism, add fresh-traffic measurement, tune demo output - Replace git-checkout reset with baseline file copies (prompts_v1.py, eval_cases_v1.json) since src/ is gitignored by the parent repo - Fix run_eval.py --golden to exit non-zero when cases fail - Replace circular same-traffic re-run in Step 5 with fresh synthetic traffic generation for honest generalization testing - Start golden eval set at 3 V1-passing cases (grows organically) - Tune traffic generation prompt with exact tool data to produce answerable questions - Add pre-flight golden eval check to run_cycle.sh - Update DEMO_SCRIPT timings from real test runs (~5 min single cycle) * Fix pre-flight dead code, Step 5 exit-on-fail, and results box alignment - Pre-flight: use set +e/set -e to capture exit code (set -e made PREFLIGHT_EXIT always 0, rendering the check dead code) - Step 5: wrap eval in set +e so failures don't kill the script (failures here are the "after" score, not errors) - Results box: use dynamic width and int-format rates for proper border alignment - Clarify --golden flag help text: it's LLM judge mode that works with any eval cases via --eval-cases * Address PR review: stale files, 0-session guard, graceful degradation Review feedback from caohy1988: - Delete stale report JSON before retry loop so previous runs can't satisfy the file-existence check (run_cycle.sh Steps 3 and 5) - Guard against 0-session quality reports in improve_agent.py to prevent prompt rewrites based on no signal - Align setup.sh env vars with README: honor PROJECT_ID if set (fall back to gcloud), rename BQ_LOCATION to DATASET_LOCATION Review feedback from PR review: - Golden gate graceful degradation: skip improvement instead of crashing with traceback when all 3 candidates fail golden eval. Failed cases are still extracted into the golden set. - Add cost notes section to README documenting per-cycle Gemini API call growth as golden set expands - Update Guardrails docs to reflect skip-on-failure behavior * Restructure Step 4/5, fix BQ staleness, revert app-name hack - Step 4: extract failed cases FIRST so regression gate validates against full golden set (original + extracted) - Step 5: mirror Steps 1-3 (golden eval, fresh BQ traffic, quality report from BQ) instead of LLM judge mode - Fix BQ staleness: increase propagation wait to 30s, add session ID guard to detect and retry when stale Step 2 sessions are returned - Revert --app-name parameter from run_eval.py (not needed) - Replace "throwaway agent" wording with "local agent" - Update DEMO_SCRIPT.md and README.md to match new flow * Fix reset.sh to use git checkout, remove redundant baseline files reset.sh now uses git checkout to restore prompts.py and eval_cases.json to V1 state. Removed prompts_v1.py and eval_cases_v1.json that were unnecessary copies. Restored committed state to V1 so git checkout works correctly. * Parallelize eval cases, remove duplicate regression check, suppress warnings - Run eval cases concurrently with asyncio.gather (both run_all_cases and run_golden_eval) for ~5-8x speedup - Remove redundant Step 5 regression check (Step 4 already validates the candidate against all golden cases) - Add PYTHONWARNINGS=ignore to suppress authlib deprecation warnings - Remove per-call -W flags (env var covers all python invocations) * Restore Step 5 regression check (run all golden cases PASS/FAIL) * Fix BQ retry: validate session count before accepting quality report * Fix dotenv override, remove duplicate Step 5 eval, update demo numbers - Fix quality_report.py dotenv override=True that overwrote demo env vars (DATASET_ID, TABLE_ID) with repo root .env values, causing 0-session results - Remove duplicate regression check from Step 5 (already validated in Step 4) - Update DEMO_SCRIPT.md with actual demo results (40% → 90%) * Add setup step to DEMO_SCRIPT.md * Throttle parallel eval to 3 concurrent requests to avoid Vertex AI 429 * Revert "Throttle parallel eval to 3 concurrent requests to avoid Vertex AI 429" This reverts commit 6591599. * Add HttpRetryOptions for 429 retries, auto-fix pre-flight failures - Use HttpRetryOptions(attempts=3) on all genai.Client and Gemini model instances instead of manual retry loops - When pre-flight golden eval fails, automatically run the improver to fix the prompt instead of just printing an error - Add --from-eval-results flag to improve_agent.py to build a synthetic quality report from golden eval results * Add reusable LoopAgent-based prompt improver module Replace the manual retry loop in improve_agent.py with an ADK LoopAgent that wraps a single LlmAgent with six tools: read_quality_report, read_current_prompt, generate_candidate, test_candidate, write_prompt, and exit_loop. The LLM decides its own workflow — when to retry, when to exit. The agent_improvement/ module is reusable for any ADK agent via ImprovementConfig (agent_factory, tools, prompt_adapter, eval set). Key changes: - agent_improvement/: new reusable module with PromptAdapter ABC, EvalRunner, TrafficGenerator, tool introspection, LoopAgent - run_improvement.py: demo entry point wiring company_info_agent - run_cycle.sh: calls run_improvement.py, handles failed improvements - README.md, DEMO_SCRIPT.md: updated architecture and Step 4 docs Tested: full cycle V1→V2 with 20%→100% quality on fresh traffic. * Deduplicate agent creation, remove old improver, fix domain leaks in reusable module - Add create_agent(prompt) factory + AGENT_TOOLS to agent/agent.py as single source of truth for agent creation - Refactor eval/run_eval.py golden mode to use EvalRunner from agent_improvement instead of duplicating judge prompt and eval logic - Refactor run_improvement.py to import create_agent + AGENT_TOOLS instead of duplicating the agent factory - Delete improver/ directory (fully replaced by agent_improvement/) - Fix hardcoded "lookup_company_policy" in extract_failed_cases() - Fix domain-specific "policy question" / "defers to HR" in JUDGE_PROMPT - Add judge_prompt field to ImprovementConfig for customization * Replace inline Python JSON parsing with jq in run_cycle.sh 7 of 10 python3 -c calls were just reading JSON fields. jq is cleaner, faster, and doesn't need a Python interpreter for simple field access. The 3 remaining python3 -c calls import CURRENT_VERSION from the agent's prompts.py module, which requires Python. * Remove unused TrafficGenerator ABC and GenericTrafficGenerator Never wired up — the demo uses eval/generate_traffic.py (domain-specific) which produces better traffic because it knows the actual policy data. * Make improvement cycle configurable via --agent-config flag Add per-agent improvement_config.py convention with build_config(), get_root_agent(), get_bq_plugin(), and metadata constants. All orchestration scripts (run_cycle.sh, run_improvement.py, run_eval.py) accept --agent-config to point at any agent. Default falls back to demo's company_info_agent. Extend PythonFilePromptAdapter with prompt_variable and version_variable params to support multi-prompt agents (e.g. knowledge_supervisor). * Revert "Make improvement cycle configurable via --agent-config flag" This reverts commit 1244b80. * Make improvement cycle configurable via JSON config Replace Python config module with declarative improve/config.json. Shell script reads metadata with jq and version with grep (zero Python calls for config). Python scripts use config_loader.py which reads JSON, imports agent module, and builds ImprovementConfig. Extend PythonFilePromptAdapter with prompt_variable and version_variable params for multi-prompt agents. * Add Vertex AI prompt storage, optimizer, and teacher model ground truth - Store prompts in Vertex AI Prompt Registry instead of local files. Agent reads prompt from VERTEX_PROMPT_ID env var on startup, falls back to local prompts.py if not set. - Add VertexPromptAdapter for cloud prompt read/write/versioning. Delete + recreate on reset for clean v1 state. - Add Vertex AI Prompt Optimizer integration (target_response mode). Teacher agent generates synthetic ground truth for failed sessions using the same tools, then feeds (question, bad_response, ground_truth) triples to the optimizer. - Fix async bug: generate_candidate and _generate_via_vertex_optimizer are now async, using await instead of run_until_complete(). - Move config.json from improve/ to project root, remove improve/ dir. Update config_loader agent_root from grandparent to parent. - setup.sh now creates Vertex AI prompt automatically (step 6/6). run_cycle.sh auto-creates if vertex_prompt_id is empty. - Update README and DEMO_SCRIPT to reflect Vertex AI architecture, config.json, optimizer, teacher model, and future/next steps. * Fix Vertex AI optimizer: tool-use directive, dependency, and cleanup - Fix dependency: vertexai>=1.148.0 -> google-cloud-aiplatform>=1.148.0 (standalone vertexai PyPI package caps at 1.71.1) - setup.sh auto-removes conflicting standalone vertexai package - Add tool-use directive to optimizer input and re-append after optimizer strips it (fixes 40%->100% instead of 40%->70%) - Save ground truth to reports/ground_truth_latest.json for inspection - Remove redundant pre-flight re-run (already validated in test_candidate) - Clear ImportError when google-cloud-aiplatform not installed - Add [improvement] extra to pyproject.toml * Mirror Vertex AI prompt updates to local prompts.py for git tracking * Classify extracted eval cases and display prompt in cycle output - Infer category and expected_tool for extracted eval cases using keyword matching against known tool topics (pto, sick_leave, remote_work, expenses, benefits, holidays, date_handling) - Fix existing 6 extracted cases: unknown -> proper categories - Display full prompt text at start and end of improvement cycle - Show version, char count, and inspect command for Vertex AI prompts * Add show_prompt.sh, use it in run_cycle.sh, reset prompts.py on reset * Add configurable teacher_model_id and update docs for accuracy Add teacher_model_id config field so the teacher agent can use a stronger model (e.g. gemini-2.5-pro) for ground truth generation. Defaults to null (same model as target agent). Update README with teacher agent explanation, fix percentages (40%->100%), document prompt_variable/version_variable/show_prompt.sh/local mirroring. Fix DEMO_SCRIPT dependency reference and result percentages. * Improve cycle UX: progress output, timeouts, and ground truth display - Add 120s timeout per eval case so one stuck request doesn't hang the run - Show agent vs teacher comparison after ground truth generation - Print status before optimizer call and before golden eval test - Fill BQ wait times with useful info (questions, golden set, ground truth count) - step_end now prints descriptive label and elapsed time - Display total wall time in minutes:seconds at end of run - Use SDK native HttpRetryOptions for 429 handling on optimizer * Fix malformed JSON retry in traffic generator, update cycle timing - Retry up to 3 times when Gemini returns invalid JSON in traffic gen - Update DEMO_SCRIPT timing: 5min->6min per cycle, 15min->18min for 3 - User-facing run_cycle.sh tweaks from prior session * Fix setup hanging on stale .env: always recreate, increase timeouts - setup.sh now always recreates .env with current project (was skipping if file existed, leaving stale VERTEX_PROMPT_ID from other projects) - Increase Vertex AI prompt create/delete timeout from 90s to 300s - Add progress prints in setup_vertex.py so user sees where it's stuck * Fix setup_vertex.py hanging: defer imports, add progress prints Move vertexai imports from module-level to inside main() so the user sees "Loading Vertex AI SDK..." before the slow import. On fresh projects, the SDK initialization can take minutes. * Address PR review: tool-use judge, traffic dedup, cost docs, timeouts - Judge prompt now checks expected_tool: fails responses that claim specifics without evidence the tool was called (catches hallucination) - Traffic generator deduplicates against golden eval set and retries if fewer than count/2 cases generated - README Cost section documents golden eval set growth and its impact - setup_vertex.py: defer imports, add progress prints, increase timeout to 300s for fresh project provisioning * feat: Add overview image and links to Vertex AI docs * Update README: add cycle diagram, fix config defaults, improve docs - Add ASCII visualization diagram showing the full improvement cycle flow - Add quick-links navigation bar at the top - Fix config table defaults (prompt_storage -> python_file, use_vertex_optimizer -> false) - Add missing config fields (vertex_project, vertex_location) - Document pre-flight check, per-case timeouts, traffic deduplication - Add gcloud command for granting IAM roles - Highlight hero moment as blockquote * feat: Add overview image and links to Vertex AI docs (upd) * feat: README(upd) * Fix PR review issues: tool-call evidence, freshness guard, domain leaks HIGH: - eval_runner: capture actual tool-call events (function_call parts), give judge objective data instead of asking it to guess from text - run_cycle.sh: surface failing cases and version before auto-fixing in pre-flight, show V_old -> V_new after fix MEDIUM: - run_cycle.sh: use actual traffic count (after dedup) for --limit instead of requested count, preventing stale session pollution - run_cycle.sh: stricter freshness guard -- zero overlap with Step 3 session IDs, not just != check - improver_agent: remove hardcoded HR-domain keywords from _classify_question, derive categories from tool names/docstrings; remove hardcoded lookup_company_policy from optimizer re-append LOW: - Consistent vertex_location from config.json in run_cycle.sh, show_prompt.sh (was hardcoded us-central1 or using DATASET_LOCATION) - Document _state single-process assumption in improver_agent.py - Add drift warning in VertexPromptAdapter.read_prompt() when local mirror version differs from Vertex AI * Remove remaining domain leaks and location hardcode from reusable module - Teacher prompt: "contact HR" -> "defer the user elsewhere" (generic) - Optimizer tool_use_directive: "policy information/question" -> generic - Optimizer Client: hardcoded "us-central1" -> config.vertex_location - Add vertex_location field to ImprovementConfig, pass through loader * Fix README markdown links * Suppress INFO log spam in run_cycle.sh via LOGLEVEL env var * Improve _classify_question: word matching, plural normalization, scoring Replace substring matching with word-boundary splitting, basic suffix stripping (-s, -ing, -ly, -ed), stop-word filtering, and score-based tool selection. Fixes (unknown) classification for PTO, expense, remote-work, and date questions. * Suppress authlib deprecation warnings in all Python entry points Add warnings.filterwarnings('ignore') before imports in run_eval.py, run_improvement.py, and quality_report.py. Use python3 -W ignore via $PY variable in run_cycle.sh for belt-and-suspenders coverage. * Add progress indicator for Vertex AI Prompt Optimizer, suppress genai warnings The optimizer is a server-side job that takes 2-4 minutes. Previously it blocked silently with no output. Now prints elapsed time every 15s via asyncio.to_thread + progress task. Also suppresses genai SDK "non-text parts" logger warnings across all entry points. * img update * Align README step labels with run_cycle.sh output Drop the 6th "REPEAT" pseudo-step and match step names to what the script actually prints (e.g. GENERATE SYNTHETIC TRAFFIC). * Filter authlib deprecation warning from stderr in run_cycle.sh The warning comes from google-adk's transitive authlib dependency and bypasses Python warning filters when a newer authlib version is installed in ~/.local/lib. Shell-level stderr filter strips the four lines of noise. * Address PR review: golden set tool-call conflict, session verification, traffic failures High: - Set expected_tool to "unknown" for 3 committed baseline cases. V1 answers from inline knowledge without calling tools, so the tool-call evidence judge was failing them on every pre-flight. - Add prominent WARNING banner when pre-flight auto-improve runs. Medium: - Traffic-mode eval now fails if any case has empty session_id or ERROR response, since those cases never reach BigQuery. - Step 3/5 save expected session IDs from eval results and verify the quality report covers the same sessions. Low: - Add -ies plural normalization to _word_forms (policies -> policy). - Document all four judge prompt placeholders in config.py. - Fix README link to quality_report.py (scripts/ -> ../../scripts/). - Remove trailing blank line at EOF in run_cycle.sh. * Fix README: config defaults, missing fields, teacher prompt, broken sentence - prompt_storage default: vertex -> python_file (matches code) - use_vertex_optimizer default: true -> false (matches code) - Add vertex_project and vertex_location to config table - Teacher prompt: "contact HR" -> "defer the user elsewhere" (matches code) - Fix incomplete sentence in The Agent section * Fix setup_vertex.py: use typed config objects for Vertex AI prompt API prompts.delete() and prompts.create() expect DeletePromptConfig and CreatePromptConfig objects, not plain dicts. The dict caused "'dict' object has no attribute 'timeout'" on delete. * Display golden eval set after starting prompt in run_cycle.sh * Add table of contents to agent improvement cycle README * Remove bash 3.2-incompatible stderr filter, fix classifier tie-breaking Remove the case/esac stderr filter inside process substitution that broke on macOS default bash. Add lexicographic tie-breaking to _classify_question so tool-order no longer affects results. --------- Co-authored-by: Haiyuan Cao <haiyuan@google.com>
1 parent 7c34289 commit 1185e9b

27 files changed

Lines changed: 4754 additions & 9 deletions
Lines changed: 255 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,255 @@
1+
# Agent Improvement Cycle - Demo Script
2+
3+
**Duration:** ~6 minutes (single cycle), ~18 minutes (3 cycles)
4+
**Format:** Live terminal walkthrough
5+
6+
---
7+
8+
## Introduction (30s)
9+
10+
Agents break in production. You write eval cases, you ship, and then
11+
users ask questions you never thought of. The eval suite goes stale.
12+
Failures pile up silently.
13+
14+
This demo shows a way to fix that: a closed-loop improvement cycle
15+
using the
16+
[BigQuery Agent Analytics SDK](https://github.com/GoogleCloudPlatform/BigQuery-Agent-Analytics-SDK).
17+
The agent runs, logs sessions to BigQuery using the
18+
[BigQuery Agent Analytics plugin for ADK](https://adk.dev/integrations/bigquery-agent-analytics/),
19+
evaluates its own quality, and uses the **Vertex AI Prompt Optimizer**
20+
to fix what failed. Prompts are stored in the **Vertex AI Prompt
21+
Registry**, versioned automatically with every improvement.
22+
23+
---
24+
25+
## Setup (1 min)
26+
27+
**Command:**
28+
```shell
29+
./setup.sh
30+
```
31+
32+
The setup script performs six checks:
33+
34+
1. **Python version:** Verifies Python 3.10+ is installed.
35+
2. **Google Cloud auth:** Confirms `gcloud` is authenticated and a
36+
project is set.
37+
3. **APIs:** Enables BigQuery and Vertex AI APIs if not already active.
38+
4. **Dependencies:** Installs Python packages (`google-cloud-aiplatform`,
39+
`google-adk`, `google-genai`, `pandas`, `python-dotenv`).
40+
5. **Configuration:** Creates the `.env` file with your project ID,
41+
BigQuery dataset, and table name. Creates the BQ dataset if needed.
42+
6. **Vertex AI prompt:** Creates the V1 prompt in the Vertex AI Prompt
43+
Registry and writes the prompt ID to `.env` and `config.json`.
44+
45+
---
46+
47+
## Show the Config (30s)
48+
49+
**Command:**
50+
```shell
51+
cat config.json
52+
```
53+
54+
This is the declarative config that drives the entire cycle. Key fields:
55+
56+
- `prompt_storage: "vertex"` -- the prompt lives in Vertex AI, not a
57+
local file
58+
- `use_vertex_optimizer: true` -- improvements use the Vertex AI Prompt
59+
Optimizer with synthetic ground truth from a teacher model
60+
- `vertex_prompt_id` -- the Vertex AI prompt resource (auto-filled by
61+
setup)
62+
63+
The same config works for any ADK agent. Change the `agent_module` and
64+
paths to point at a different agent.
65+
66+
---
67+
68+
## Show the V1 Prompt (30s)
69+
70+
**Command:**
71+
```shell
72+
cat agent/prompts.py
73+
```
74+
75+
This is the V1 seed prompt. It has intentional flaws that mirror common
76+
real-world mistakes:
77+
78+
- It tells the agent to "answer from the knowledge above" instead of
79+
calling its tools.
80+
- It covers PTO, sick leave, and remote work, but says nothing about
81+
expenses or holidays. Those tools exist, but the prompt ignores them.
82+
- Benefits are described as "competitive" with no details. The agent
83+
will guess or deflect.
84+
- There is no mention of the `get_current_date` tool, so date-related
85+
questions like "Is next Friday a holiday?" will fail.
86+
87+
The tools can answer all of these questions. The prompt simply does not
88+
guide the agent to use them. The agent reads this prompt from the
89+
Vertex AI Prompt Registry at startup (not from the file directly).
90+
91+
---
92+
93+
## Show the Golden Eval Set (30s)
94+
95+
**Command:**
96+
```shell
97+
cat eval/eval_cases.json
98+
```
99+
100+
This is the golden eval set -- the regression gate. Three cases that
101+
the V1 prompt already handles correctly: PTO days, sick leave, and
102+
remote work. The golden set starts small and grows each cycle as
103+
failed synthetic cases are extracted into it.
104+
105+
These cases are the floor. No prompt change is accepted unless every
106+
golden case still passes. As the cycle runs, failed synthetic cases
107+
get added here, raising the bar each iteration.
108+
109+
---
110+
111+
## Run One Cycle (~6 min)
112+
113+
**Command:**
114+
```shell
115+
./run_cycle.sh
116+
```
117+
118+
### Pre-flight: Golden Eval (~25s)
119+
120+
The script starts by running the golden eval set against the current
121+
prompt. This verifies the starting point: all 3 cases should pass
122+
with V1. If any fail, the script auto-improves to fix them before
123+
proceeding.
124+
125+
### Step 1: Generate Synthetic Traffic (~20s)
126+
127+
The script calls Gemini to generate 10 diverse user questions. These
128+
are intentionally different from the golden set -- varied phrasing,
129+
situational questions, covering all six policy topics. They simulate
130+
real-world traffic the agent has not been tuned for.
131+
132+
### Step 2: Run Traffic Through Agent (~30-40s)
133+
134+
The generated questions are sent to the agent using ADK's
135+
`InMemoryRunner`. The agent runs locally and executes its tools
136+
against local policy data.
137+
138+
Every session is automatically logged to BigQuery by the
139+
[BigQuery Agent Analytics plugin](https://adk.dev/integrations/bigquery-agent-analytics/).
140+
The full trace is captured: user question, tool calls, LLM responses.
141+
No extra logging code required.
142+
143+
*(as output scrolls)* Some questions get proper answers, others get
144+
"I don't have that information, contact HR."
145+
146+
### Step 3: Evaluate Quality (~25s)
147+
148+
The SDK's quality report reads those sessions back from BigQuery and
149+
scores each one on two dimensions:
150+
151+
- **Response usefulness:** Was the answer meaningful, partial, or
152+
unhelpful?
153+
- **Task grounding:** Was the answer based on tool output, or did the
154+
agent make something up?
155+
156+
*(point to the quality summary)* Around 40% meaningful. The agent had
157+
the right tools all along -- the prompt just did not let it use them.
158+
159+
### Step 4: Improve Prompt (~1-2 min)
160+
161+
This step uses an ADK **LoopAgent** -- an agent that improves the
162+
agent. It has six tools and decides its own workflow:
163+
164+
1. **Extract failures:** Failed synthetic cases are pulled from the
165+
quality report and added to the golden eval set. The golden set
166+
grows from 3 to ~10 cases.
167+
168+
2. **Generate ground truth:** A "teacher agent" (same tools, better
169+
prompt) re-answers each failed question to produce what the correct
170+
response should have been. This is the synthetic ground truth.
171+
172+
3. **Optimize prompt:** The current prompt + (question, bad_response,
173+
ground_truth) triples are sent to the **Vertex AI Prompt Optimizer**
174+
in `target_response` mode. The optimizer generates a structurally
175+
improved prompt.
176+
177+
4. **Regression gate:** The optimized prompt is tested against the
178+
FULL golden set (original 3 + extracted failures). All cases must
179+
pass. If any fail, the LLM analyzes why and generates a better
180+
candidate. The loop exits when all cases pass (via `exit_loop`)
181+
or after 3 iterations.
182+
183+
5. **Write to Vertex AI:** The validated prompt is written to the
184+
Vertex AI Prompt Registry as a new version.
185+
186+
The improvement module is reusable -- it works with any ADK agent, not
187+
just this demo. You provide a `config.json` with your agent module,
188+
tools, and eval set.
189+
190+
*(point to output)* V1 becomes V2. The candidate passed all 10 cases.
191+
192+
### Step 5: Measure Improvement (~2-3 min)
193+
194+
Step 5 mirrors Steps 1-3 but with the improved prompt:
195+
196+
1. **Fresh traffic:** Gemini generates a NEW batch of 10 questions.
197+
Re-running the Step 1 traffic would be circular -- the prompt was
198+
specifically fixed to handle those questions.
199+
200+
2. **Run through agent:** The fresh questions are sent to the V2
201+
agent and logged to BigQuery -- exactly like Step 2.
202+
203+
3. **Score from BigQuery:** The SDK's quality report reads the new
204+
sessions from BigQuery and scores them -- exactly like Step 3.
205+
206+
*(point to the results box)*
207+
208+
```
209+
Before (V1): 40% meaningful (4/10 sessions)
210+
After (V2): 100% meaningful (10/10 sessions)
211+
```
212+
213+
From 40% to 100% in one automated cycle, scored from BigQuery on
214+
entirely new questions.
215+
216+
---
217+
218+
## Multi-Cycle Run (optional, ~18 min)
219+
220+
To show the full loop with prompt refinement across cycles:
221+
222+
```shell
223+
./reset.sh
224+
./run_cycle.sh --cycles 3
225+
```
226+
227+
Each cycle generates fresh synthetic traffic, evaluates, improves, and
228+
measures. The golden eval set grows with each cycle as new edge cases
229+
are discovered and locked in.
230+
231+
---
232+
233+
## Wrap-Up (30s)
234+
235+
**Command:**
236+
```shell
237+
cat eval/eval_cases.json
238+
```
239+
240+
The prompt evolved from V1 to V2 (or V4 with 3 cycles), each version
241+
stored as a new version in the Vertex AI Prompt Registry. The golden
242+
eval set grew from 3 cases to 10+ cases, each new case extracted from
243+
a real failure.
244+
245+
The key idea: the golden eval set is the regression gate. Synthetic
246+
traffic discovers new failures. The Vertex AI Prompt Optimizer fixes
247+
the prompt using teacher-generated ground truth. The golden eval
248+
ensures nothing breaks. Failed cases are extracted into the golden set
249+
so they never recur. Over time, your tests reflect what users actually
250+
ask -- not what you guessed they would.
251+
252+
To reset and run again:
253+
```shell
254+
./reset.sh
255+
```

0 commit comments

Comments
 (0)