Skip to content

Commit 372662d

Browse files
authored
fix(cycle): timestamped outputs, log cleanup, quality threshold (#102)
* Safe defaults for run_cycle.sh: --auto opt-in, N=100 docs - Flip AUTO_CONTINUE default from true to false (single cycle by default) - Replace --no-auto flag with --auto (opt-in to multi-cycle) - Update README with N=100 experiment findings and cost notes - Update DEMO_NARRATION with new flag names * feat(improvement-cycle): add category-aware failure extraction, rename config fields Add max_failure_extract config for controlling how many failed cases are extracted into the golden eval set each cycle. At low traffic (N=10), extracting all 3-5 failures works fine. At higher volumes (N>=100), 30-40+ failures overwhelm the optimizer's regression gate. Three modes: null (extract all — small-N demo default), "auto" (two-tier category-aware selection capped at 2x categories), or an integer hard cap. Rename config fields for clarity: - max_extract -> max_failure_extract (indicates failed-case scope) - max_attempts -> optimizer_max_iterations (indicates Vertex AI Prompt Optimizer iteration budget, not a generic retry count) Config loader preserves backwards compatibility with the old names via fallback lookups. * fix(cycle): filter quality report by session IDs, tolerate traffic timeouts Two issues broke run_cycle.sh at high traffic volumes (N>=100): 1. Quality report session contamination: quality_report.py used --limit N --time-period 24h, which pulled sessions from prior runs, pre-flight checks, and earlier cycles sharing the same app_name. A 150-question run scored 315 sessions. Fix: add --session-ids-file flag to quality_report.py that filters to exactly the session IDs produced by the current traffic run. run_cycle.sh now passes the saved eval results file (Steps 3 and 5) instead of relying on time-based queries. 2. Traffic timeouts abort the cycle: run_eval.py exits non-zero when any case times out or errors. With set -euo pipefail, this killed the entire cycle at Step 2. Fix: wrap both traffic calls (Steps 2 and 5) in set +e so timeouts are logged but don't abort the cycle. Timed-out cases have no session_id and are naturally excluded from quality scoring. * fix(improvement-cycle): throttle concurrent LLM calls to prevent 429 rate-limit crashes Add asyncio.Semaphore(5) to _generate_ground_truth and run_golden_eval to cap concurrent Vertex AI requests. Previously all failed sessions fired in parallel (100+), overwhelming the quota. Also catch per-session errors in the teacher agent so a single 429 skips that session instead of crashing the entire improvement step. * small visual fix * docs(scripts): add session filtering flags to quality_report README Document --app-name, --session-ids-file, --output-json, and --threshold flags. Add Filtering section explaining session ID and app name filtering. * fix(cycle): truncate traffic to requested count, throttle eval concurrency generate_traffic.py: Gemini can return more cases than requested (e.g. 201 instead of 100). Truncate to the requested count. run_eval.py: add Semaphore(5) to cap concurrent agent calls. With 100+ cases firing simultaneously, requests queue behind rate limits and many exceed the 120s per-case timeout. * fix(metrics): display turn count as integer instead of float Turn count showed "1.0 turns" instead of "1 turns". Add fmt=int config to the turn_count metric and apply integer formatting in both baseline and comparison table output. * fix(cycle): limit optimizer ground truth to extracted cases, fix turn_count display improver_agent.py: filter report sessions to only the extracted cases before passing to the LoopAgent. Previously max_failure_extract="auto" extracted 12 cases into the golden set but the optimizer generated ground truth for all 42 failures -- wasteful and rate-limit prone. operational_metrics.py: round turn_count to int at the data level so it displays as "1 turns" instead of "1.0 turns". * refactor(cycle): replace print statements with structured logging, add timestamps to shell steps Centralizes logging config in __init__.py, suppresses noisy third-party loggers, and adds HH:MM:SS timestamps to run_cycle.sh for easier debugging of long runs. Updates docs to reflect auto failure extraction. * fix(cycle): route outputs to timestamped dirs, suppress SDK log noise - Add --output-dir to run_eval.py and run_improvement.py so all artifacts (eval results, quality reports, ground truth) land in the per-run timestamped reports/ subdirectory instead of scattered locations - Suppress noisy SDK loggers (google_genai, google_adk, httpx, authlib) - Strip ANSI escape codes from run.log - Make quality threshold configurable - Update README DEMO_NARRATION to match * style: autoformat __init__.py and run_eval.py
1 parent c0bc97b commit 372662d

17 files changed

Lines changed: 781 additions & 251 deletions

examples/agent_improvement_cycle/DEMO_NARRATION.md

Lines changed: 8 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -144,7 +144,7 @@ The agent uses its tools and gives direct, grounded answers. No more "contact HR
144144
Ten out of ten sessions scored as helpful and grounded — in a single cycle.
145145

146146
And now the operational comparison — the same deterministic metrics we captured as a baseline in Step 3, run on the V2 sessions and shown side by side.
147-
Latency stayed flat or improved — the V2 prompt routes to tools directly instead of deliberating. Token usage is comparable.
147+
Latency consistently drops with V2 — the V1 agent spends time deliberating before refusing, while V2 routes to tools immediately and responds. Token usage increases because V2 calls tools for every question instead of refusing some, but stays well within budget.
148148
Turn count unchanged. The improved prompt didn't trade quality for cost.
149149

150150

@@ -184,8 +184,8 @@ To reset everything back to V1 and start over:
184184
./reset.sh
185185
```
186186

187-
This reverts the prompt in the Vertex AI Prompt Registry to V1, restores the original three golden eval cases,
188-
and clears reports.
187+
This reverts the prompt in the Vertex AI Prompt Registry to V1 and restores the original three golden eval cases.
188+
Previous run reports (under `reports/run_*/`) are preserved.
189189

190190
There are multiple options of how to run the flow:
191191
```shell
@@ -197,19 +197,20 @@ Options:
197197
--agent-config F Path to agent's config.json
198198
(default: config.json)
199199
--cycles N Run N improvement cycles (default: 1)
200-
--no-auto Always run all N cycles even if 100%
201-
meaningful (default: stop early)
200+
--auto Enable auto-cycling: run up to N cycles,
201+
stop early when quality meets threshold
202202
--eval-only Only run evaluation (Steps 1-3), skip improvement
203203
--app-name X Override agent app name for BQ filtering
204204
--traffic-count N Number of synthetic questions per cycle (default: 10)
205+
--threshold N Override quality_threshold (0-100, default: from config)
205206
-h, --help Show this help message
206207
```
207208

208209
You can run again with multiple cycles to see iterative refinement:
209210
```shell
210-
./run_cycle.sh --cycles 3
211+
./run_cycle.sh --auto --cycles 3
211212
```
212-
By default, the cycle stops early once all synthetic traffic scores 100% meaningful. Each cycle generates fresh traffic, evaluates, improves, and measures. The golden eval set grows with each cycle as new edge cases are discovered.
213+
By default, the script runs a single cycle and stops. The `--auto` flag enables auto-cycling, which runs up to N cycles and stops early once quality meets `quality_threshold` from `config.json` (default: 0.95 = 95%). The threshold is set below 100% because LLM output is non-deterministic -- at N=100, ~1% variance is noise, not a systematic gap worth another optimizer cycle. Each cycle generates fresh traffic, evaluates, improves, and measures. The golden eval set grows with each cycle as new edge cases are discovered.
213214

214215
---
215216

examples/agent_improvement_cycle/README.md

Lines changed: 155 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -136,12 +136,34 @@ before/after comparison to verify the improved prompt didn't regress
136136
on cost or performance. No extra agent runs — just math on the session
137137
data already in BigQuery.
138138

139-
Run multiple cycles with `--cycles N`. By default, the cycle stops
140-
early once all synthetic traffic scores 100% meaningful. Use
141-
`--no-auto` to force all N cycles to run regardless.
142-
143-
The hero moment: quality typically climbs from ~40% to ~100% in a single cycle
144-
(results vary due to non-deterministic LLM output).
139+
By default, the script runs a **single cycle** and stops. This is the
140+
safe default -- each cycle makes dozens of Gemini API calls, and
141+
running multiple cycles unintentionally can lead to unexpected costs.
142+
143+
To run multiple improvement cycles, use `--auto --cycles N`. The
144+
`--auto` flag enables auto-cycling, which runs up to N cycles and
145+
stops early once quality meets the `quality_threshold` setting in
146+
`config.json` (default: `0.95`, i.e. 95% meaningful).
147+
148+
**Why 95% and not 100%?** LLM output is non-deterministic. At N=100
149+
traffic, a single stochastic misfire causes a 1% drop. Setting the
150+
threshold to 100% leads to cycles that fight random variance rather
151+
than fix systematic gaps. The 95% default means: stop when real
152+
failures are gone, don't chase noise. If the improvement step finds
153+
quality already at or above the threshold, it skips the optimizer
154+
entirely and the cycle moves on. If no new prompt version is produced,
155+
the measurement step (Step 5) is also skipped -- there is nothing to
156+
compare.
157+
158+
The hero moment: quality typically climbs from ~60% to ~100% in a single cycle
159+
(results vary due to non-deterministic LLM output). With the default
160+
N=10 traffic, the improvement step typically succeeds on the first
161+
optimizer attempt. At higher traffic volumes (`--traffic-count 100`),
162+
the system discovers more failures but `max_failure_extract: "auto"`
163+
applies category-aware selection to extract a representative subset
164+
(~12 cases from ~42 failures in a typical run), keeping the regression
165+
gate strict but manageable. Use `--auto --cycles 3` for higher-N runs
166+
to give the optimizer multiple cycles to converge if needed.
145167

146168
### Why This Matters
147169

@@ -208,7 +230,7 @@ All agent-specific settings live in a single declarative config file:
208230
"eval_cases_path": "eval/eval_cases.json",
209231
"traffic_generator": "eval/generate_traffic.py",
210232
"model_id": "gemini-2.5-flash",
211-
"max_attempts": 3,
233+
"optimizer_max_iterations": 3,
212234
"prompt_storage": "vertex",
213235
"vertex_prompt_id": "1234567890",
214236
"use_vertex_optimizer": true,
@@ -269,6 +291,17 @@ When the cycle identifies failed sessions, it uses the
269291
4. **Validate**: Test the optimized prompt against the full golden
270292
eval set before accepting it.
271293

294+
The optimizer also receives the agent's **tool signatures**, auto-extracted
295+
from the Python functions by `tool_introspection.py` using `inspect` --
296+
function name, parameter types, and full docstrings. These are appended
297+
to the prompt as plain text so the optimizer knows what tools exist and
298+
what arguments they accept. This is how the V2 prompt ends up with
299+
explicit topic-to-tool mappings: the optimizer saw the tool's signature,
300+
saw the teacher successfully calling it with specific arguments, and
301+
generated routing instructions accordingly. If the optimizer's output
302+
strips the tool references (which it tends to do), they are
303+
re-appended automatically.
304+
272305
This replaces raw "ask Gemini to rewrite the prompt" with a
273306
structured optimization pipeline backed by real failure data.
274307

@@ -332,10 +365,10 @@ Failed sessions from quality report
332365
toward tool-grounded answers
333366
```
334367

335-
The teacher's answers are saved to `reports/ground_truth_latest.json`
336-
for inspection. Each entry contains the original question, the bad
337-
response from the target agent, and the teacher's ground truth
338-
answer.
368+
The teacher's answers are saved to
369+
`reports/run_YYYYMMDD_HHMMSS/ground_truth_latest.json` for inspection.
370+
Each entry contains the original question, the bad response from the
371+
target agent, and the teacher's ground truth answer.
339372

340373
#### Why not just use the teacher prompt directly?
341374

@@ -437,13 +470,44 @@ BigQuery from Steps 2 and 5; no additional agent runs are needed. See
437470
- **Golden eval gate**: Candidate prompts must pass ALL golden cases.
438471
Rejected if any fail, retried up to 3 times.
439472
- **Eval case extraction**: Failed synthetic cases are added to the
440-
golden set before improvement, raising the bar each cycle.
473+
golden set before improvement, raising the bar each cycle. The
474+
`max_failure_extract` config controls how many cases are extracted (see
475+
[Scaling extraction](#scaling-extraction) below).
441476
- **Question deduplication**: Extracted cases are deduplicated by both
442477
ID and question text.
443478
- **Length check**: Prompts shorter than 50 characters are rejected.
444479
- **Retry with backoff**: Quality report step retries for BigQuery
445480
write propagation delays.
446481

482+
### Scaling extraction
483+
484+
At the default traffic volume (N=10), the system typically discovers
485+
3-5 failures, all of which are extracted into the golden eval set.
486+
The regression gate (3 original + 3-5 extracted = ~8 cases) is
487+
manageable for the optimizer to satisfy in one pass.
488+
489+
At higher volumes (`--traffic-count 100`), the system discovers
490+
30-43 failures. Extracting all of them creates a regression gate of
491+
40+ cases, which is often too strict for the optimizer to satisfy
492+
in a single attempt. Many of these failures are redundant — 15
493+
might be "benefits" questions that all fail the same way.
494+
495+
The `max_failure_extract` config field controls this:
496+
497+
| Value | Behavior |
498+
|-------|----------|
499+
| `null` (default) | Extract **all** failures — every unhelpful or partial session becomes a golden eval case. This is the right choice for the small-N demo (N=10) where there are only 3-5 failures. At higher traffic volumes it can overwhelm the optimizer (see below). |
500+
| `"auto"` | Two-tier category-aware selection. Tier 1: one failure per category (breadth). Tier 2: fill proportionally from heaviest categories. Budget = 2 × number of failing categories. For 6 categories, that's ~12 cases. |
501+
| Integer (e.g. `10`) | Hard cap with category-aware selection. Same two-tier logic. |
502+
503+
Example config for high-traffic runs:
504+
505+
```json
506+
{
507+
"max_failure_extract": "auto"
508+
}
509+
```
510+
447511
## Quick Start
448512

449513
### Prerequisites
@@ -487,50 +551,113 @@ git tracking.
487551
### 3. Run the demo
488552

489553
```bash
490-
# Single improvement cycle
554+
# Single improvement cycle (default — safe for experimentation)
491555
./run_cycle.sh
492556

493-
# Up to 3 cycles, stops early when 100% meaningful (default behavior)
494-
./run_cycle.sh --cycles 3
557+
# Auto-cycle: run up to 3 cycles, stop early when quality meets threshold (95%)
558+
./run_cycle.sh --auto --cycles 3
495559

496-
# Force all 3 cycles even if 100% is reached early
497-
./run_cycle.sh --cycles 3 --no-auto
560+
# Exactly 3 cycles (no early stopping)
561+
./run_cycle.sh --cycles 3
498562

499563
# Eval only (no improvement step)
500564
./run_cycle.sh --eval-only
501565

502566
# Customize traffic volume
503-
./run_cycle.sh --cycles 3 --traffic-count 20
567+
./run_cycle.sh --auto --cycles 3 --traffic-count 20
568+
569+
# Scaled run (N=100)
570+
./run_cycle.sh --auto --cycles 5 --traffic-count 100
504571

505572
# Use a different agent's config
506573
./run_cycle.sh --agent-config /path/to/other/config.json
507574
```
508575

576+
The scaled run combines all the flags:
577+
578+
| Flag | What it does |
579+
|------|--------------|
580+
| `--auto` | Stop early when quality meets `quality_threshold` (default 95%) |
581+
| `--cycles 5` | Run up to 5 improvement cycles |
582+
| `--traffic-count 100` | Generate 100 synthetic questions per cycle (default: 10) |
583+
584+
All output is automatically logged to `reports/run_YYYYMMDD_HHMMSS/run.log`
585+
(ANSI colour codes stripped for readability). Each run gets its own
586+
timestamped subdirectory under `reports/`, so previous runs are preserved.
587+
588+
> **Cost note:** Each cycle makes ~50-80 Gemini API calls (more with
589+
> higher `--traffic-count`). Running `./run_cycle.sh` with no flags is
590+
> always safe (1 cycle). Use `--auto --cycles N` only when you
591+
> intentionally want multiple iterations.
592+
593+
#### Standalone quality report
594+
595+
The `quality_report.sh` wrapper can be run independently. Use
596+
`--env` to point at the right `.env` file (otherwise it loads the
597+
repo root `.env` which may target a different dataset):
598+
599+
```bash
600+
# From anywhere — explicit .env
601+
../../scripts/quality_report.sh \
602+
--env .env \
603+
--app-name company_info_agent \
604+
--time-period all --limit 100
605+
```
606+
607+
The `--env` flag is also available on `quality_report.py` directly.
608+
509609
### 4. Inspect results
510610

511-
After a run, check the `reports/` directory:
611+
Each run creates a timestamped subdirectory under `reports/`:
612+
613+
```
614+
reports/
615+
run_20260430_174920/ # one directory per run
616+
run.log # full console output (ANSI stripped)
617+
synthetic_traffic_cycle_1.json # generated questions (Step 1)
618+
latest_eval_results.json # session IDs + responses (Step 2)
619+
expected_session_ids_cycle_1.json # copy of eval results for BQ lookup
620+
quality_report_cycle_1.json # LLM judge scores (Step 3)
621+
operational_metrics_cycle_1_baseline.json # latency/tokens/turns (Step 3)
622+
ground_truth_latest.json # teacher agent answers (Step 4)
623+
synthetic_traffic_cycle_1_fresh.json # fresh questions (Step 5)
624+
expected_session_ids_cycle_1_fresh.json # fresh session IDs (Step 5)
625+
quality_report_cycle_1_after.json # post-improvement scores (Step 5)
626+
operational_metrics_cycle_1.json # before/after comparison (Step 5)
627+
run_20260430_183045/ # next run — previous runs are preserved
628+
...
629+
```
630+
631+
Previous runs are never deleted. `reset.sh` only resets the prompt
632+
and golden eval set, not the reports directory.
512633

513634
```bash
635+
# Browse runs
636+
ls reports/
637+
514638
# Quality report JSON (consumed by the improver)
515-
cat reports/quality_report_cycle_1.json | python3 -m json.tool | head -20
639+
cat reports/run_*/quality_report_cycle_1.json | python3 -m json.tool | head -20
516640

517-
# Synthetic traffic that was generated
518-
cat reports/synthetic_traffic_cycle_1.json | python3 -m json.tool | head -20
641+
# Full console log
642+
less reports/run_20260430_174920/run.log
519643

520644
# See new eval cases extracted from failures
521645
cat eval/eval_cases.json
522646
```
523647

524648
### Reset to V1
525649

526-
To start over, reset everything to the initial state (fresh V1
527-
prompt in Vertex AI, 3 golden eval cases, no reports):
650+
To start over, reset the prompt and golden eval set to their initial
651+
state. Previous run reports are preserved.
528652

529653
```bash
530654
./reset.sh
531655
```
532656

533-
This deletes the old Vertex AI prompt and creates a fresh one at V1.
657+
This restores the V1 prompt in Vertex AI, resets `eval_cases.json` to
658+
the original 3 golden cases, and removes generated synthetic traffic
659+
files. The `reports/` directory (with timestamped run subdirectories)
660+
is not deleted.
534661

535662
## Using with Other Agents
536663

@@ -556,13 +683,14 @@ agent:
556683
| `eval_cases_path` | required | Path to golden eval set JSON |
557684
| `traffic_generator` | required | Path to traffic generation script |
558685
| `model_id` | `gemini-2.5-flash` | Gemini model for agent and judge |
559-
| `max_attempts` | `3` | Max prompt improvement attempts per cycle |
686+
| `optimizer_max_iterations` | `3` | Max Vertex AI Prompt Optimizer iterations per improvement step |
560687
| `prompt_storage` | `python_file` | `vertex` or `python_file` |
561688
| `vertex_prompt_id` | `""` | Vertex AI prompt ID (auto-filled by setup) |
562689
| `vertex_project` | from `gcloud` | GCP project for Vertex AI (defaults to env) |
563690
| `vertex_location` | `us-central1` | Vertex AI region |
564691
| `use_vertex_optimizer` | `false` | Use Vertex AI Prompt Optimizer |
565692
| `teacher_model_id` | `null` | Model for teacher agent (null = same as `model_id`) |
693+
| `max_failure_extract` | `null` | Max failed cases to extract per cycle. `null` = extract **all** failures (best for the small-N demo where N<=20). `"auto"` = two-tier category-aware selection (~2x categories). Integer = hard cap with category-aware selection. See [Scaling extraction](#scaling-extraction). |
566694

567695
### Environment variables (.env)
568696

@@ -586,7 +714,7 @@ Gemini calls; a 3-cycle run uses ~200-300.
586714
**Golden eval set growth:** The golden eval set grows each cycle as
587715
failed synthetic cases are extracted into it. Each improvement attempt
588716
validates the candidate prompt against the full golden set (N agent
589-
calls + N judge calls per attempt, up to `max_attempts` retries).
717+
calls + N judge calls per attempt, up to `optimizer_max_iterations` retries).
590718
After several cycles, the golden set can reach 20+ cases, increasing
591719
both cost and runtime of the validation step. For long-running
592720
deployments, consider periodically pruning redundant golden cases.

examples/agent_improvement_cycle/agent_improvement/__init__.py

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,43 @@
1919
cases pass.
2020
"""
2121

22+
import logging
23+
import warnings
24+
25+
warnings.filterwarnings("ignore")
26+
27+
# authlib forces simplefilter("always") at import time; neutralise it
28+
# by importing the module early and overriding the filter.
29+
try:
30+
import authlib.deprecate
31+
32+
warnings.filterwarnings(
33+
"ignore", category=authlib.deprecate.AuthlibDeprecationWarning
34+
)
35+
except ImportError:
36+
pass
37+
38+
logging.basicConfig(
39+
level=logging.INFO,
40+
format="[%(asctime)s] %(message)s",
41+
datefmt="%H:%M:%S",
42+
)
43+
# Suppress noisy third-party loggers.
44+
# google.genai / google_genai — "AFC is enabled", "will take precedence"
45+
# google_adk — "Sending out request", "Response received"
46+
# httpx / httpcore — "HTTP Request: POST ..."
47+
for _noisy in (
48+
"google.genai",
49+
"google_genai",
50+
"google.adk",
51+
"google_adk",
52+
"google.auth",
53+
"google_auth",
54+
"httpx",
55+
"httpcore",
56+
):
57+
logging.getLogger(_noisy).setLevel(logging.ERROR)
58+
2259
from agent_improvement.config import ImprovementConfig
2360
from agent_improvement.config_loader import load_agent_module
2461
from agent_improvement.config_loader import load_config

examples/agent_improvement_cycle/agent_improvement/config.py

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -40,8 +40,9 @@ class ImprovementConfig:
4040
(e.g. a ``prompts.py`` file).
4141
eval_cases_path: Path to the golden eval set JSON file.
4242
model_id: Gemini model used by the improver and judge LLMs.
43-
max_attempts: Maximum number of candidate prompts to try before
44-
giving up.
43+
optimizer_max_iterations: Maximum number of prompt optimizer
44+
iterations per improvement step (Vertex AI Prompt Optimizer
45+
retry budget).
4546
quality_threshold: Fraction of golden cases that must pass
4647
(1.0 = all cases).
4748
teacher_model_id: Optional Gemini model for the teacher agent that
@@ -63,9 +64,10 @@ class ImprovementConfig:
6364
prompt_adapter: PromptAdapter
6465
eval_cases_path: str
6566
model_id: str = "gemini-2.5-flash"
66-
max_attempts: int = 3
67-
quality_threshold: float = 1.0
67+
optimizer_max_iterations: int = 3
68+
quality_threshold: float = 0.95
6869
judge_prompt: str | None = None
6970
teacher_model_id: str | None = None
7071
use_vertex_optimizer: bool = False
7172
vertex_location: str = "us-central1"
73+
max_failure_extract: int | str | None = None

0 commit comments

Comments
 (0)