Commit 5db449c
feat(evaluate): LLM-judge FAIL output with bounded feedback + --strict docs (#86)
* fix(llm-judge): full prompt parity + execution_mode/fallback_reason (#42)
* fix(llm-judge): full prompt parity + execution_mode/fallback_reason
Two coupled fixes for LLM-as-Judge: the BQ-native paths now see the
full Python prompt template, and the resulting report stamps which
path actually ran (and why earlier tiers fell back when applicable).
## Prompt parity (F1)
`_ai_generate_judge` and `_bqml_judge` previously sent only
`prompt_template.split("{trace_text}")[0]` to BigQuery — i.e., the
prefix up to the first placeholder. Everything after `{trace_text}`
in the Python template (including the per-criterion JSON output
spec the judge model needs to score consistently) was silently
dropped on the SQL paths. That made AI.GENERATE / ML.GENERATE_TEXT
score against a different prompt than the API-fallback path, which
uses the whole template via `str.format(...)`.
Fix:
- New helper `evaluators.split_judge_prompt_template(template)` that
format()s the template with `\x00`-bracketed sentinels for both
placeholders, then partitions the result into
`(prefix, middle, suffix)`. Sentinels avoid clashing with literal
template content; running the format pass ensures `{{...}}`
escapes are correctly un-escaped before partitioning, so the SQL
CONCAT sees the same string the API path produces.
- `AI_GENERATE_JUDGE_BATCH_QUERY` and `_LEGACY_LLM_JUDGE_BATCH_QUERY`
now `CONCAT(@judge_prompt_prefix, trace_text, @judge_prompt_middle,
COALESCE(final_response, 'N/A'), @judge_prompt_suffix)` — three
parameters instead of one.
- Both judge methods in `client.py` swap `judge_prompt` for the
three new parameters.
## execution_mode + fallback_reason (F2/F3)
`_evaluate_llm_judge` now stamps `report.details["execution_mode"]`
with one of `ai_generate`, `ml_generate_text`, `api_fallback`, or
`no_op` — matching the value-space the categorical evaluator
already uses. When an earlier tier raises before a later tier
succeeds, `report.details["fallback_reason"]` carries the chained
exception messages in attempt order so CI gates and dashboards can
audit which path actually ran. Categorical-style underscore naming
is intentional — readers reading both LLM-judge and categorical
reports see the same vocabulary.
## Tests
- Prompt parity: assert AI.GENERATE and ML.GENERATE_TEXT receive
three `judge_prompt_{prefix,middle,suffix}` params instead of the
single `judge_prompt`, and that concatenation reproduces the
full Python template (including the JSON output spec).
- Execution mode: assert each of ai_generate / ml_generate_text /
api_fallback fires under the right cascade conditions, and that
`fallback_reason` names the prior tiers in attempt order.
- `split_judge_prompt_template`: round-trip, missing-placeholder
fallback paths, full-template-as-prefix when neither placeholder
is present.
CHANGELOG entry added under `[Unreleased]`.
Required publish blocker for blog post #3 (#82). PR #2 in this
series will tighten the LLM-judge `evaluate --exit-code` FAIL
output to surface criterion + threshold + bounded `llm_feedback`
snippet.
Ref: #82, #51.
* fix(llm-judge): keep synthesized labels next to their values
Reviewer flagged that ``split_judge_prompt_template`` mishandles
custom templates with one placeholder. The SQL CONCAT runs
prefix ++ trace_text ++ middle ++ final_response ++ suffix
so a synthesized label for an absent placeholder must end up
*immediately before* the value it labels. Earlier fallback
branches placed the labels on the wrong side:
- ``{trace_text}`` only — ``Response:`` label landed AFTER the
injected response value.
- ``{final_response}`` only — ``Trace:`` label landed AFTER the
injected trace value, and the user's prompt prose ended up
AFTER the trace instead of before it.
- No placeholders — labels appended after their values for both
trace and response.
Built-in correctness/hallucination/sentiment templates were
unaffected because they declare both placeholders explicitly, so
the dual-placeholder branch (which was correct) handled them.
Fix: rewrite the three fallback branches so the SQL CONCAT yields
``...<user prompt>...\nTrace:\n<TRACE>\nResponse:\n<RESPONSE>...``
in every case. Updated the docstring to document the rebuild
contract precisely.
Tests: replaced the two-segment "label is somewhere in suffix"
assertions with full-rebuild assertions that verify ordering of
user prose, ``Trace:`` label, trace value, ``Response:`` label,
and response value. Three new regression tests cover all three
fallback branches.
Ref: PR #42 review feedback.
* feat(evaluate): LLM-judge FAIL output with bounded feedback + --strict docs (#44)
* feat(evaluate): LLM-judge FAIL output with bounded feedback + --strict docs
Builds on the runtime-behavior fixes from #42 (now merged on main):
the report shape this PR depends on — ``SessionScore.llm_feedback``
populated by both BQ-native and API-fallback judge paths, and
``EvaluationReport.details["execution_mode"]`` distinguishing the
three tiers — is stable as of 0.2.2 + #42.
## F5 — feedback snippet on LLM-judge FAIL lines
``_emit_evaluate_failures`` now appends a bounded
``feedback="…"`` field after ``score`` / ``threshold`` whenever a
failing session's ``SessionScore.llm_feedback`` is non-empty. The
snippet:
- Collapses internal whitespace runs (newlines, tabs, doubled
spaces) to a single space, so multi-line judge justifications
stay on one CI log line.
- Truncates at 120 characters with a U+2026 ellipsis when the
collapsed string is longer.
- Is omitted entirely for code-based metrics (their
``llm_feedback`` is ``None``), so post #2's deterministic FAIL
output stays visually identical.
Pulled the formatter into a small public-internal helper
(``_format_feedback_snippet``) so the CLI tests can exercise it
directly without round-tripping through Click.
The safety-net "no per-metric detail available" branch also
carries the snippet now — failing sessions with no metric details
still surface the judge's reason rather than a generic placeholder.
## F4 — ``--strict`` help/docs match shipped behavior
The previous ``--strict`` help text ("Fail sessions with
unparseable judge output") promised a broader effect than what
ships. In reality:
- API-fallback parse errors already coerce to ``score=0.0`` and
fail any non-zero threshold without ``--strict``.
- ``--strict`` only changes AI.GENERATE rows whose typed output is
empty/NULL: without it they're silently skipped (the default
``passed=True`` for empty-scores SessionScores); with it they're
explicitly failed and counted under ``report.details``.
Updated the CLI help string to name AI.GENERATE specifically and
to call out that API-fallback parse errors don't need the flag.
Rewrote ``SDK.md §4 Strict Mode`` to lead with the AI.GENERATE
path scope before getting into operational counters.
## Tests (10 new)
- LLM-judge FAIL line carries the feedback snippet, with the
judge's actual prose visible in CI output.
- Long feedback truncates at the 120-char cap with U+2026.
- Multi-line feedback collapses to a single CI log line.
- Code-based metric failures emit no ``feedback`` field
(regression guard against bleed from the LLM path).
- Direct unit tests on ``_format_feedback_snippet``: ``None`` /
empty / whitespace-only inputs return ``None``; short inputs
pass through; whitespace runs collapse; default and custom
``max_chars`` truncation respected.
CHANGELOG entry added under ``[Unreleased]`` (Added + Changed
sections).
Required publish blocker for blog post #3 (#82). Pairs with #42 as
PR 2 of 2 in that series.
Ref: #82, #51.
* docs(strict): describe parse-error visibility, not pass/fail flipping
Reviewer flagged that the prior --strict rewrite still mischaracterized
the AI.GENERATE behavior: it claimed strict=False silently skips
empty/NULL typed-output rows and leaves them passing, then says
strict=True flips them to failing. That's wrong on both ends.
Actual code path:
- ``_ai_generate_judge`` (and ``_bqml_judge``) compute
``passed = bool(scores) and all(score >= threshold ...)``.
Empty-scores rows already have passed=False — locked in by
TestFalsePassFix.test_empty_score_fails on the runtime side.
- ``_apply_strict_mode`` walks the report and adds
``details['parse_error']=True`` to every empty-scores session
plus a report-level ``parse_errors`` / ``parse_error_rate``
counter. It does not change any session's pass/fail status.
- The API-fallback path coerces malformed output to ``score=0.0``,
so its parse failures fail as low-score failures and don't
surface through ``--strict``.
Net: ``--strict`` is a visibility knob, not a gate-affecting flag.
For pass/fail-only CI consumers it's a no-op; for dashboards or
investigations it lets you tell ``low score`` failures apart from
``no parseable score`` failures.
Updated:
- ``cli.py:strict`` help text — replaces the misleading
"fail/silently-skipped" framing with the parse-error metadata
framing the code actually implements.
- ``SDK.md §4 Strict Mode`` — leads with "adds parse-error
visibility, does not flip pass/fail," then enumerates exactly
what ``_apply_strict_mode`` does.
- ``CHANGELOG.md`` ``[Unreleased]`` entry — same correction.
No code or test changes — only docs/help. Existing pytest run
still 2080 passed, 4 skipped.
Ref: PR #44 review feedback.
---------
Co-authored-by: Haiyuan Cao <haiyuan@google.com>1 parent 7baa2b0 commit 5db449c
3 files changed
Lines changed: 321 additions & 10 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
275 | 275 | | |
276 | 276 | | |
277 | 277 | | |
278 | | - | |
| 278 | + | |
| 279 | + | |
| 280 | + | |
| 281 | + | |
| 282 | + | |
| 283 | + | |
| 284 | + | |
| 285 | + | |
| 286 | + | |
| 287 | + | |
| 288 | + | |
| 289 | + | |
| 290 | + | |
| 291 | + | |
| 292 | + | |
| 293 | + | |
| 294 | + | |
| 295 | + | |
| 296 | + | |
| 297 | + | |
| 298 | + | |
| 299 | + | |
| 300 | + | |
| 301 | + | |
| 302 | + | |
| 303 | + | |
| 304 | + | |
| 305 | + | |
| 306 | + | |
| 307 | + | |
279 | 308 | | |
280 | 309 | | |
281 | 310 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
302 | 302 | | |
303 | 303 | | |
304 | 304 | | |
305 | | - | |
| 305 | + | |
| 306 | + | |
| 307 | + | |
| 308 | + | |
| 309 | + | |
| 310 | + | |
| 311 | + | |
| 312 | + | |
306 | 313 | | |
307 | 314 | | |
308 | 315 | | |
| |||
368 | 375 | | |
369 | 376 | | |
370 | 377 | | |
| 378 | + | |
| 379 | + | |
| 380 | + | |
| 381 | + | |
| 382 | + | |
| 383 | + | |
| 384 | + | |
| 385 | + | |
| 386 | + | |
| 387 | + | |
| 388 | + | |
| 389 | + | |
| 390 | + | |
| 391 | + | |
| 392 | + | |
| 393 | + | |
| 394 | + | |
| 395 | + | |
| 396 | + | |
| 397 | + | |
| 398 | + | |
| 399 | + | |
| 400 | + | |
| 401 | + | |
| 402 | + | |
371 | 403 | | |
372 | 404 | | |
373 | 405 | | |
| |||
377 | 409 | | |
378 | 410 | | |
379 | 411 | | |
380 | | - | |
381 | | - | |
382 | | - | |
383 | | - | |
| 412 | + | |
| 413 | + | |
| 414 | + | |
| 415 | + | |
| 416 | + | |
| 417 | + | |
| 418 | + | |
| 419 | + | |
384 | 420 | | |
385 | 421 | | |
386 | 422 | | |
| |||
393 | 429 | | |
394 | 430 | | |
395 | 431 | | |
| 432 | + | |
396 | 433 | | |
397 | 434 | | |
398 | 435 | | |
| |||
433 | 470 | | |
434 | 471 | | |
435 | 472 | | |
| 473 | + | |
| 474 | + | |
| 475 | + | |
| 476 | + | |
| 477 | + | |
| 478 | + | |
436 | 479 | | |
437 | 480 | | |
438 | 481 | | |
| |||
441 | 484 | | |
442 | 485 | | |
443 | 486 | | |
444 | | - | |
445 | | - | |
446 | | - | |
447 | | - | |
| 487 | + | |
| 488 | + | |
| 489 | + | |
| 490 | + | |
| 491 | + | |
| 492 | + | |
448 | 493 | | |
449 | 494 | | |
450 | 495 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
420 | 420 | | |
421 | 421 | | |
422 | 422 | | |
| 423 | + | |
| 424 | + | |
| 425 | + | |
| 426 | + | |
| 427 | + | |
| 428 | + | |
| 429 | + | |
| 430 | + | |
| 431 | + | |
| 432 | + | |
| 433 | + | |
| 434 | + | |
| 435 | + | |
| 436 | + | |
| 437 | + | |
| 438 | + | |
| 439 | + | |
| 440 | + | |
| 441 | + | |
| 442 | + | |
| 443 | + | |
| 444 | + | |
| 445 | + | |
| 446 | + | |
| 447 | + | |
| 448 | + | |
| 449 | + | |
| 450 | + | |
| 451 | + | |
| 452 | + | |
| 453 | + | |
| 454 | + | |
| 455 | + | |
| 456 | + | |
| 457 | + | |
| 458 | + | |
| 459 | + | |
| 460 | + | |
| 461 | + | |
| 462 | + | |
| 463 | + | |
| 464 | + | |
| 465 | + | |
| 466 | + | |
| 467 | + | |
| 468 | + | |
| 469 | + | |
| 470 | + | |
| 471 | + | |
| 472 | + | |
| 473 | + | |
| 474 | + | |
| 475 | + | |
| 476 | + | |
| 477 | + | |
| 478 | + | |
| 479 | + | |
| 480 | + | |
| 481 | + | |
| 482 | + | |
| 483 | + | |
| 484 | + | |
| 485 | + | |
| 486 | + | |
| 487 | + | |
| 488 | + | |
| 489 | + | |
| 490 | + | |
| 491 | + | |
| 492 | + | |
| 493 | + | |
| 494 | + | |
| 495 | + | |
| 496 | + | |
| 497 | + | |
| 498 | + | |
| 499 | + | |
| 500 | + | |
| 501 | + | |
| 502 | + | |
| 503 | + | |
| 504 | + | |
| 505 | + | |
| 506 | + | |
| 507 | + | |
| 508 | + | |
| 509 | + | |
| 510 | + | |
| 511 | + | |
| 512 | + | |
| 513 | + | |
| 514 | + | |
| 515 | + | |
| 516 | + | |
| 517 | + | |
| 518 | + | |
| 519 | + | |
| 520 | + | |
| 521 | + | |
| 522 | + | |
| 523 | + | |
| 524 | + | |
| 525 | + | |
| 526 | + | |
| 527 | + | |
| 528 | + | |
| 529 | + | |
| 530 | + | |
| 531 | + | |
| 532 | + | |
| 533 | + | |
| 534 | + | |
| 535 | + | |
| 536 | + | |
| 537 | + | |
| 538 | + | |
| 539 | + | |
| 540 | + | |
| 541 | + | |
| 542 | + | |
| 543 | + | |
| 544 | + | |
| 545 | + | |
| 546 | + | |
| 547 | + | |
| 548 | + | |
| 549 | + | |
| 550 | + | |
| 551 | + | |
| 552 | + | |
| 553 | + | |
| 554 | + | |
| 555 | + | |
| 556 | + | |
| 557 | + | |
| 558 | + | |
| 559 | + | |
| 560 | + | |
| 561 | + | |
| 562 | + | |
| 563 | + | |
| 564 | + | |
| 565 | + | |
| 566 | + | |
| 567 | + | |
| 568 | + | |
| 569 | + | |
| 570 | + | |
| 571 | + | |
| 572 | + | |
| 573 | + | |
| 574 | + | |
| 575 | + | |
| 576 | + | |
| 577 | + | |
| 578 | + | |
| 579 | + | |
| 580 | + | |
| 581 | + | |
| 582 | + | |
| 583 | + | |
| 584 | + | |
| 585 | + | |
| 586 | + | |
| 587 | + | |
| 588 | + | |
| 589 | + | |
| 590 | + | |
| 591 | + | |
| 592 | + | |
| 593 | + | |
| 594 | + | |
| 595 | + | |
| 596 | + | |
| 597 | + | |
| 598 | + | |
| 599 | + | |
| 600 | + | |
| 601 | + | |
| 602 | + | |
| 603 | + | |
| 604 | + | |
| 605 | + | |
| 606 | + | |
| 607 | + | |
| 608 | + | |
| 609 | + | |
| 610 | + | |
| 611 | + | |
| 612 | + | |
| 613 | + | |
| 614 | + | |
| 615 | + | |
| 616 | + | |
| 617 | + | |
| 618 | + | |
423 | 619 | | |
424 | 620 | | |
425 | 621 | | |
| |||
567 | 763 | | |
568 | 764 | | |
569 | 765 | | |
| 766 | + | |
| 767 | + | |
| 768 | + | |
| 769 | + | |
| 770 | + | |
| 771 | + | |
| 772 | + | |
| 773 | + | |
| 774 | + | |
| 775 | + | |
| 776 | + | |
| 777 | + | |
| 778 | + | |
| 779 | + | |
| 780 | + | |
| 781 | + | |
| 782 | + | |
| 783 | + | |
| 784 | + | |
| 785 | + | |
| 786 | + | |
| 787 | + | |
| 788 | + | |
| 789 | + | |
| 790 | + | |
| 791 | + | |
| 792 | + | |
| 793 | + | |
| 794 | + | |
| 795 | + | |
| 796 | + | |
| 797 | + | |
| 798 | + | |
| 799 | + | |
| 800 | + | |
| 801 | + | |
| 802 | + | |
| 803 | + | |
| 804 | + | |
| 805 | + | |
| 806 | + | |
570 | 807 | | |
571 | 808 | | |
572 | 809 | | |
| |||
0 commit comments