Commit 372662d
authored
fix(cycle): timestamped outputs, log cleanup, quality threshold (#102)
* Safe defaults for run_cycle.sh: --auto opt-in, N=100 docs
- Flip AUTO_CONTINUE default from true to false (single cycle by default)
- Replace --no-auto flag with --auto (opt-in to multi-cycle)
- Update README with N=100 experiment findings and cost notes
- Update DEMO_NARRATION with new flag names
* feat(improvement-cycle): add category-aware failure extraction, rename config fields
Add max_failure_extract config for controlling how many failed cases
are extracted into the golden eval set each cycle. At low traffic
(N=10), extracting all 3-5 failures works fine. At higher volumes
(N>=100), 30-40+ failures overwhelm the optimizer's regression gate.
Three modes: null (extract all — small-N demo default), "auto"
(two-tier category-aware selection capped at 2x categories), or an
integer hard cap.
Rename config fields for clarity:
- max_extract -> max_failure_extract (indicates failed-case scope)
- max_attempts -> optimizer_max_iterations (indicates Vertex AI
Prompt Optimizer iteration budget, not a generic retry count)
Config loader preserves backwards compatibility with the old names
via fallback lookups.
* fix(cycle): filter quality report by session IDs, tolerate traffic timeouts
Two issues broke run_cycle.sh at high traffic volumes (N>=100):
1. Quality report session contamination: quality_report.py used
--limit N --time-period 24h, which pulled sessions from prior
runs, pre-flight checks, and earlier cycles sharing the same
app_name. A 150-question run scored 315 sessions. Fix: add
--session-ids-file flag to quality_report.py that filters to
exactly the session IDs produced by the current traffic run.
run_cycle.sh now passes the saved eval results file (Steps 3
and 5) instead of relying on time-based queries.
2. Traffic timeouts abort the cycle: run_eval.py exits non-zero
when any case times out or errors. With set -euo pipefail,
this killed the entire cycle at Step 2. Fix: wrap both traffic
calls (Steps 2 and 5) in set +e so timeouts are logged but
don't abort the cycle. Timed-out cases have no session_id and
are naturally excluded from quality scoring.
* fix(improvement-cycle): throttle concurrent LLM calls to prevent 429 rate-limit crashes
Add asyncio.Semaphore(5) to _generate_ground_truth and run_golden_eval
to cap concurrent Vertex AI requests. Previously all failed sessions
fired in parallel (100+), overwhelming the quota. Also catch per-session
errors in the teacher agent so a single 429 skips that session instead
of crashing the entire improvement step.
* small visual fix
* docs(scripts): add session filtering flags to quality_report README
Document --app-name, --session-ids-file, --output-json, and --threshold
flags. Add Filtering section explaining session ID and app name filtering.
* fix(cycle): truncate traffic to requested count, throttle eval concurrency
generate_traffic.py: Gemini can return more cases than requested (e.g.
201 instead of 100). Truncate to the requested count.
run_eval.py: add Semaphore(5) to cap concurrent agent calls. With 100+
cases firing simultaneously, requests queue behind rate limits and many
exceed the 120s per-case timeout.
* fix(metrics): display turn count as integer instead of float
Turn count showed "1.0 turns" instead of "1 turns". Add fmt=int config
to the turn_count metric and apply integer formatting in both baseline
and comparison table output.
* fix(cycle): limit optimizer ground truth to extracted cases, fix turn_count display
improver_agent.py: filter report sessions to only the extracted cases
before passing to the LoopAgent. Previously max_failure_extract="auto"
extracted 12 cases into the golden set but the optimizer generated
ground truth for all 42 failures -- wasteful and rate-limit prone.
operational_metrics.py: round turn_count to int at the data level so
it displays as "1 turns" instead of "1.0 turns".
* refactor(cycle): replace print statements with structured logging, add timestamps to shell steps
Centralizes logging config in __init__.py, suppresses noisy third-party
loggers, and adds HH:MM:SS timestamps to run_cycle.sh for easier
debugging of long runs. Updates docs to reflect auto failure extraction.
* fix(cycle): route outputs to timestamped dirs, suppress SDK log noise
- Add --output-dir to run_eval.py and run_improvement.py so all artifacts
(eval results, quality reports, ground truth) land in the per-run
timestamped reports/ subdirectory instead of scattered locations
- Suppress noisy SDK loggers (google_genai, google_adk, httpx, authlib)
- Strip ANSI escape codes from run.log
- Make quality threshold configurable
- Update README DEMO_NARRATION to match
* style: autoformat __init__.py and run_eval.py1 parent c0bc97b commit 372662d
17 files changed
Lines changed: 781 additions & 251 deletions
File tree
- examples/agent_improvement_cycle
- agent_improvement
- eval
- scripts
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
144 | 144 | | |
145 | 145 | | |
146 | 146 | | |
147 | | - | |
| 147 | + | |
148 | 148 | | |
149 | 149 | | |
150 | 150 | | |
| |||
184 | 184 | | |
185 | 185 | | |
186 | 186 | | |
187 | | - | |
188 | | - | |
| 187 | + | |
| 188 | + | |
189 | 189 | | |
190 | 190 | | |
191 | 191 | | |
| |||
197 | 197 | | |
198 | 198 | | |
199 | 199 | | |
200 | | - | |
201 | | - | |
| 200 | + | |
| 201 | + | |
202 | 202 | | |
203 | 203 | | |
204 | 204 | | |
| 205 | + | |
205 | 206 | | |
206 | 207 | | |
207 | 208 | | |
208 | 209 | | |
209 | 210 | | |
210 | | - | |
| 211 | + | |
211 | 212 | | |
212 | | - | |
| 213 | + | |
213 | 214 | | |
214 | 215 | | |
215 | 216 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
136 | 136 | | |
137 | 137 | | |
138 | 138 | | |
139 | | - | |
140 | | - | |
141 | | - | |
142 | | - | |
143 | | - | |
144 | | - | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
145 | 167 | | |
146 | 168 | | |
147 | 169 | | |
| |||
208 | 230 | | |
209 | 231 | | |
210 | 232 | | |
211 | | - | |
| 233 | + | |
212 | 234 | | |
213 | 235 | | |
214 | 236 | | |
| |||
269 | 291 | | |
270 | 292 | | |
271 | 293 | | |
| 294 | + | |
| 295 | + | |
| 296 | + | |
| 297 | + | |
| 298 | + | |
| 299 | + | |
| 300 | + | |
| 301 | + | |
| 302 | + | |
| 303 | + | |
| 304 | + | |
272 | 305 | | |
273 | 306 | | |
274 | 307 | | |
| |||
332 | 365 | | |
333 | 366 | | |
334 | 367 | | |
335 | | - | |
336 | | - | |
337 | | - | |
338 | | - | |
| 368 | + | |
| 369 | + | |
| 370 | + | |
| 371 | + | |
339 | 372 | | |
340 | 373 | | |
341 | 374 | | |
| |||
437 | 470 | | |
438 | 471 | | |
439 | 472 | | |
440 | | - | |
| 473 | + | |
| 474 | + | |
| 475 | + | |
441 | 476 | | |
442 | 477 | | |
443 | 478 | | |
444 | 479 | | |
445 | 480 | | |
446 | 481 | | |
| 482 | + | |
| 483 | + | |
| 484 | + | |
| 485 | + | |
| 486 | + | |
| 487 | + | |
| 488 | + | |
| 489 | + | |
| 490 | + | |
| 491 | + | |
| 492 | + | |
| 493 | + | |
| 494 | + | |
| 495 | + | |
| 496 | + | |
| 497 | + | |
| 498 | + | |
| 499 | + | |
| 500 | + | |
| 501 | + | |
| 502 | + | |
| 503 | + | |
| 504 | + | |
| 505 | + | |
| 506 | + | |
| 507 | + | |
| 508 | + | |
| 509 | + | |
| 510 | + | |
447 | 511 | | |
448 | 512 | | |
449 | 513 | | |
| |||
487 | 551 | | |
488 | 552 | | |
489 | 553 | | |
490 | | - | |
| 554 | + | |
491 | 555 | | |
492 | 556 | | |
493 | | - | |
494 | | - | |
| 557 | + | |
| 558 | + | |
495 | 559 | | |
496 | | - | |
497 | | - | |
| 560 | + | |
| 561 | + | |
498 | 562 | | |
499 | 563 | | |
500 | 564 | | |
501 | 565 | | |
502 | 566 | | |
503 | | - | |
| 567 | + | |
| 568 | + | |
| 569 | + | |
| 570 | + | |
504 | 571 | | |
505 | 572 | | |
506 | 573 | | |
507 | 574 | | |
508 | 575 | | |
| 576 | + | |
| 577 | + | |
| 578 | + | |
| 579 | + | |
| 580 | + | |
| 581 | + | |
| 582 | + | |
| 583 | + | |
| 584 | + | |
| 585 | + | |
| 586 | + | |
| 587 | + | |
| 588 | + | |
| 589 | + | |
| 590 | + | |
| 591 | + | |
| 592 | + | |
| 593 | + | |
| 594 | + | |
| 595 | + | |
| 596 | + | |
| 597 | + | |
| 598 | + | |
| 599 | + | |
| 600 | + | |
| 601 | + | |
| 602 | + | |
| 603 | + | |
| 604 | + | |
| 605 | + | |
| 606 | + | |
| 607 | + | |
| 608 | + | |
509 | 609 | | |
510 | 610 | | |
511 | | - | |
| 611 | + | |
| 612 | + | |
| 613 | + | |
| 614 | + | |
| 615 | + | |
| 616 | + | |
| 617 | + | |
| 618 | + | |
| 619 | + | |
| 620 | + | |
| 621 | + | |
| 622 | + | |
| 623 | + | |
| 624 | + | |
| 625 | + | |
| 626 | + | |
| 627 | + | |
| 628 | + | |
| 629 | + | |
| 630 | + | |
| 631 | + | |
| 632 | + | |
512 | 633 | | |
513 | 634 | | |
| 635 | + | |
| 636 | + | |
| 637 | + | |
514 | 638 | | |
515 | | - | |
| 639 | + | |
516 | 640 | | |
517 | | - | |
518 | | - | |
| 641 | + | |
| 642 | + | |
519 | 643 | | |
520 | 644 | | |
521 | 645 | | |
522 | 646 | | |
523 | 647 | | |
524 | 648 | | |
525 | 649 | | |
526 | | - | |
527 | | - | |
| 650 | + | |
| 651 | + | |
528 | 652 | | |
529 | 653 | | |
530 | 654 | | |
531 | 655 | | |
532 | 656 | | |
533 | | - | |
| 657 | + | |
| 658 | + | |
| 659 | + | |
| 660 | + | |
534 | 661 | | |
535 | 662 | | |
536 | 663 | | |
| |||
556 | 683 | | |
557 | 684 | | |
558 | 685 | | |
559 | | - | |
| 686 | + | |
560 | 687 | | |
561 | 688 | | |
562 | 689 | | |
563 | 690 | | |
564 | 691 | | |
565 | 692 | | |
| 693 | + | |
566 | 694 | | |
567 | 695 | | |
568 | 696 | | |
| |||
586 | 714 | | |
587 | 715 | | |
588 | 716 | | |
589 | | - | |
| 717 | + | |
590 | 718 | | |
591 | 719 | | |
592 | 720 | | |
| |||
Lines changed: 37 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
19 | 19 | | |
20 | 20 | | |
21 | 21 | | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
22 | 59 | | |
23 | 60 | | |
24 | 61 | | |
| |||
Lines changed: 6 additions & 4 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
40 | 40 | | |
41 | 41 | | |
42 | 42 | | |
43 | | - | |
44 | | - | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
45 | 46 | | |
46 | 47 | | |
47 | 48 | | |
| |||
63 | 64 | | |
64 | 65 | | |
65 | 66 | | |
66 | | - | |
67 | | - | |
| 67 | + | |
| 68 | + | |
68 | 69 | | |
69 | 70 | | |
70 | 71 | | |
71 | 72 | | |
| 73 | + | |
0 commit comments