Skip to content

Commit 5db449c

Browse files
feat(evaluate): LLM-judge FAIL output with bounded feedback + --strict docs (#86)
* fix(llm-judge): full prompt parity + execution_mode/fallback_reason (#42) * fix(llm-judge): full prompt parity + execution_mode/fallback_reason Two coupled fixes for LLM-as-Judge: the BQ-native paths now see the full Python prompt template, and the resulting report stamps which path actually ran (and why earlier tiers fell back when applicable). ## Prompt parity (F1) `_ai_generate_judge` and `_bqml_judge` previously sent only `prompt_template.split("{trace_text}")[0]` to BigQuery — i.e., the prefix up to the first placeholder. Everything after `{trace_text}` in the Python template (including the per-criterion JSON output spec the judge model needs to score consistently) was silently dropped on the SQL paths. That made AI.GENERATE / ML.GENERATE_TEXT score against a different prompt than the API-fallback path, which uses the whole template via `str.format(...)`. Fix: - New helper `evaluators.split_judge_prompt_template(template)` that format()s the template with `\x00`-bracketed sentinels for both placeholders, then partitions the result into `(prefix, middle, suffix)`. Sentinels avoid clashing with literal template content; running the format pass ensures `{{...}}` escapes are correctly un-escaped before partitioning, so the SQL CONCAT sees the same string the API path produces. - `AI_GENERATE_JUDGE_BATCH_QUERY` and `_LEGACY_LLM_JUDGE_BATCH_QUERY` now `CONCAT(@judge_prompt_prefix, trace_text, @judge_prompt_middle, COALESCE(final_response, 'N/A'), @judge_prompt_suffix)` — three parameters instead of one. - Both judge methods in `client.py` swap `judge_prompt` for the three new parameters. ## execution_mode + fallback_reason (F2/F3) `_evaluate_llm_judge` now stamps `report.details["execution_mode"]` with one of `ai_generate`, `ml_generate_text`, `api_fallback`, or `no_op` — matching the value-space the categorical evaluator already uses. When an earlier tier raises before a later tier succeeds, `report.details["fallback_reason"]` carries the chained exception messages in attempt order so CI gates and dashboards can audit which path actually ran. Categorical-style underscore naming is intentional — readers reading both LLM-judge and categorical reports see the same vocabulary. ## Tests - Prompt parity: assert AI.GENERATE and ML.GENERATE_TEXT receive three `judge_prompt_{prefix,middle,suffix}` params instead of the single `judge_prompt`, and that concatenation reproduces the full Python template (including the JSON output spec). - Execution mode: assert each of ai_generate / ml_generate_text / api_fallback fires under the right cascade conditions, and that `fallback_reason` names the prior tiers in attempt order. - `split_judge_prompt_template`: round-trip, missing-placeholder fallback paths, full-template-as-prefix when neither placeholder is present. CHANGELOG entry added under `[Unreleased]`. Required publish blocker for blog post #3 (#82). PR #2 in this series will tighten the LLM-judge `evaluate --exit-code` FAIL output to surface criterion + threshold + bounded `llm_feedback` snippet. Ref: #82, #51. * fix(llm-judge): keep synthesized labels next to their values Reviewer flagged that ``split_judge_prompt_template`` mishandles custom templates with one placeholder. The SQL CONCAT runs prefix ++ trace_text ++ middle ++ final_response ++ suffix so a synthesized label for an absent placeholder must end up *immediately before* the value it labels. Earlier fallback branches placed the labels on the wrong side: - ``{trace_text}`` only — ``Response:`` label landed AFTER the injected response value. - ``{final_response}`` only — ``Trace:`` label landed AFTER the injected trace value, and the user's prompt prose ended up AFTER the trace instead of before it. - No placeholders — labels appended after their values for both trace and response. Built-in correctness/hallucination/sentiment templates were unaffected because they declare both placeholders explicitly, so the dual-placeholder branch (which was correct) handled them. Fix: rewrite the three fallback branches so the SQL CONCAT yields ``...<user prompt>...\nTrace:\n<TRACE>\nResponse:\n<RESPONSE>...`` in every case. Updated the docstring to document the rebuild contract precisely. Tests: replaced the two-segment "label is somewhere in suffix" assertions with full-rebuild assertions that verify ordering of user prose, ``Trace:`` label, trace value, ``Response:`` label, and response value. Three new regression tests cover all three fallback branches. Ref: PR #42 review feedback. * feat(evaluate): LLM-judge FAIL output with bounded feedback + --strict docs (#44) * feat(evaluate): LLM-judge FAIL output with bounded feedback + --strict docs Builds on the runtime-behavior fixes from #42 (now merged on main): the report shape this PR depends on — ``SessionScore.llm_feedback`` populated by both BQ-native and API-fallback judge paths, and ``EvaluationReport.details["execution_mode"]`` distinguishing the three tiers — is stable as of 0.2.2 + #42. ## F5 — feedback snippet on LLM-judge FAIL lines ``_emit_evaluate_failures`` now appends a bounded ``feedback="…"`` field after ``score`` / ``threshold`` whenever a failing session's ``SessionScore.llm_feedback`` is non-empty. The snippet: - Collapses internal whitespace runs (newlines, tabs, doubled spaces) to a single space, so multi-line judge justifications stay on one CI log line. - Truncates at 120 characters with a U+2026 ellipsis when the collapsed string is longer. - Is omitted entirely for code-based metrics (their ``llm_feedback`` is ``None``), so post #2's deterministic FAIL output stays visually identical. Pulled the formatter into a small public-internal helper (``_format_feedback_snippet``) so the CLI tests can exercise it directly without round-tripping through Click. The safety-net "no per-metric detail available" branch also carries the snippet now — failing sessions with no metric details still surface the judge's reason rather than a generic placeholder. ## F4 — ``--strict`` help/docs match shipped behavior The previous ``--strict`` help text ("Fail sessions with unparseable judge output") promised a broader effect than what ships. In reality: - API-fallback parse errors already coerce to ``score=0.0`` and fail any non-zero threshold without ``--strict``. - ``--strict`` only changes AI.GENERATE rows whose typed output is empty/NULL: without it they're silently skipped (the default ``passed=True`` for empty-scores SessionScores); with it they're explicitly failed and counted under ``report.details``. Updated the CLI help string to name AI.GENERATE specifically and to call out that API-fallback parse errors don't need the flag. Rewrote ``SDK.md §4 Strict Mode`` to lead with the AI.GENERATE path scope before getting into operational counters. ## Tests (10 new) - LLM-judge FAIL line carries the feedback snippet, with the judge's actual prose visible in CI output. - Long feedback truncates at the 120-char cap with U+2026. - Multi-line feedback collapses to a single CI log line. - Code-based metric failures emit no ``feedback`` field (regression guard against bleed from the LLM path). - Direct unit tests on ``_format_feedback_snippet``: ``None`` / empty / whitespace-only inputs return ``None``; short inputs pass through; whitespace runs collapse; default and custom ``max_chars`` truncation respected. CHANGELOG entry added under ``[Unreleased]`` (Added + Changed sections). Required publish blocker for blog post #3 (#82). Pairs with #42 as PR 2 of 2 in that series. Ref: #82, #51. * docs(strict): describe parse-error visibility, not pass/fail flipping Reviewer flagged that the prior --strict rewrite still mischaracterized the AI.GENERATE behavior: it claimed strict=False silently skips empty/NULL typed-output rows and leaves them passing, then says strict=True flips them to failing. That's wrong on both ends. Actual code path: - ``_ai_generate_judge`` (and ``_bqml_judge``) compute ``passed = bool(scores) and all(score >= threshold ...)``. Empty-scores rows already have passed=False — locked in by TestFalsePassFix.test_empty_score_fails on the runtime side. - ``_apply_strict_mode`` walks the report and adds ``details['parse_error']=True`` to every empty-scores session plus a report-level ``parse_errors`` / ``parse_error_rate`` counter. It does not change any session's pass/fail status. - The API-fallback path coerces malformed output to ``score=0.0``, so its parse failures fail as low-score failures and don't surface through ``--strict``. Net: ``--strict`` is a visibility knob, not a gate-affecting flag. For pass/fail-only CI consumers it's a no-op; for dashboards or investigations it lets you tell ``low score`` failures apart from ``no parseable score`` failures. Updated: - ``cli.py:strict`` help text — replaces the misleading "fail/silently-skipped" framing with the parse-error metadata framing the code actually implements. - ``SDK.md §4 Strict Mode`` — leads with "adds parse-error visibility, does not flip pass/fail," then enumerates exactly what ``_apply_strict_mode`` does. - ``CHANGELOG.md`` ``[Unreleased]`` entry — same correction. No code or test changes — only docs/help. Existing pytest run still 2080 passed, 4 skipped. Ref: PR #44 review feedback. --------- Co-authored-by: Haiyuan Cao <haiyuan@google.com>
1 parent 7baa2b0 commit 5db449c

3 files changed

Lines changed: 321 additions & 10 deletions

File tree

SDK.md

Lines changed: 30 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -275,7 +275,36 @@ print(report.summary())
275275

276276
### Strict Mode
277277

278-
When `strict=True`, sessions where the LLM judge returns empty or unparseable output are marked as **failed** instead of silently passing. Operational counters are placed in `report.details` (not `aggregate_scores`) so downstream consumers can treat scores as purely normalized metrics:
278+
`strict=True` adds **parse-error visibility** — it does not flip
279+
any session's pass/fail outcome. Both BQ-native judge methods set
280+
`passed = bool(scores) and all(score >= threshold for score in
281+
scores.values())`, so a row whose `scores` dict is empty (the
282+
judge model returned no parseable output) already fails. Without
283+
`strict=True` you can't tell from the report whether a failed
284+
session failed because the judge gave a low score or because the
285+
judge gave nothing parseable at all.
286+
287+
`strict=True` walks the merged report and:
288+
289+
- Stamps `SessionScore.details["parse_error"] = True` on every
290+
session whose `scores` dict is empty.
291+
- Adds a report-level `details["parse_errors"]` count plus
292+
`details["parse_error_rate"]` (fraction of `total_sessions`).
293+
294+
The API-fallback path coerces malformed model output to
295+
`score=0.0` and always populates `scores`, so its failures look
296+
like low-score failures rather than parse errors. `strict=True`
297+
won't surface them as parse errors today; it's an AI.GENERATE /
298+
ML.GENERATE_TEXT visibility knob in practice.
299+
300+
For pass/fail-only consumers (CI gates with `--exit-code`),
301+
`strict=True` is a no-op. Reach for it when a dashboard or
302+
investigation needs to distinguish "no parseable score" from
303+
"low score" failures.
304+
305+
Operational counters are placed in `report.details` (not
306+
`aggregate_scores`) so downstream consumers can treat scores as
307+
purely normalized metrics:
279308

280309
```python
281310
report = client.evaluate(

src/bigquery_agent_analytics/cli.py

Lines changed: 54 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -302,7 +302,14 @@ def evaluate(
302302
),
303303
strict: bool = typer.Option(
304304
False,
305-
help="Fail sessions with unparseable judge output.",
305+
help=(
306+
"Stamp parse-error metadata on AI.GENERATE judge rows with"
307+
" empty or NULL typed output. Those rows already fail"
308+
" (empty score < threshold); --strict adds"
309+
" details['parse_error']=True and a report-level"
310+
" parse_errors counter so dashboards can tell 'no"
311+
" parseable score' apart from 'low score' failures."
312+
),
306313
),
307314
endpoint: Optional[str] = typer.Option(
308315
None,
@@ -368,6 +375,31 @@ def evaluate(
368375
raise typer.Exit(code=2)
369376

370377

378+
_FEEDBACK_SNIPPET_MAX = 120
379+
380+
381+
def _format_feedback_snippet(
382+
feedback: Optional[str], max_chars: int = _FEEDBACK_SNIPPET_MAX
383+
) -> Optional[str]:
384+
"""Return a single-line, bounded snippet of an LLM-judge justification.
385+
386+
Collapses internal whitespace runs (including newlines) to a single
387+
space so the snippet fits on one CI log line, then truncates to
388+
``max_chars`` with a trailing ``…`` when the original was longer.
389+
Returns ``None`` for empty / whitespace-only input so callers can
390+
cleanly skip the field.
391+
"""
392+
if not feedback:
393+
return None
394+
collapsed = " ".join(feedback.split())
395+
if not collapsed:
396+
return None
397+
if len(collapsed) <= max_chars:
398+
return collapsed
399+
# Reserve one char for the ellipsis to keep the visual width capped.
400+
return collapsed[: max_chars - 1].rstrip() + "\u2026"
401+
402+
371403
def _emit_evaluate_failures(
372404
report: EvaluationReport, max_sessions: int = 10
373405
) -> None:
@@ -377,10 +409,14 @@ def _emit_evaluate_failures(
377409
Prefers the raw observed + budget pair (``CodeEvaluator`` prebuilts);
378410
falls back to score + threshold when the metric didn't declare
379411
observed/budget (custom ``add_metric`` users, ``LLMAsJudge``
380-
criteria). A failing session is guaranteed to produce at least one
381-
FAIL line — never just the summary header.
382-
383-
Capped at ``max_sessions`` most-recent failures so CI logs stay scannable.
412+
criteria). For LLM-judge failures the line also carries a bounded
413+
``feedback="…"`` snippet drawn from ``SessionScore.llm_feedback``
414+
so CI logs explain *why* the judge said the session failed without
415+
forcing the reader to chase the JSON output.
416+
417+
A failing session is guaranteed to produce at least one FAIL line —
418+
never just the summary header. Capped at ``max_sessions`` most-recent
419+
failures so CI logs stay scannable.
384420
"""
385421
failed = [s for s in report.session_scores if not s.passed]
386422
if not failed:
@@ -393,6 +429,7 @@ def _emit_evaluate_failures(
393429
)
394430
shown = failed[:max_sessions]
395431
for s in shown:
432+
feedback_snippet = _format_feedback_snippet(s.llm_feedback)
396433
emitted_for_session = False
397434
for metric_name, score in s.scores.items():
398435
detail = s.details.get(f"metric_{metric_name}") or {}
@@ -433,6 +470,12 @@ def _emit_evaluate_failures(
433470
parts.append(f"score={score:.4g}")
434471
if threshold is not None and isinstance(threshold, (int, float)):
435472
parts.append(f"threshold={threshold:.4g}")
473+
# LLM judges populate ``SessionScore.llm_feedback`` with the
474+
# judge's justification. Surface a bounded snippet on the FAIL
475+
# line so CI logs explain *why* without dumping the full JSON.
476+
# Code-based metrics leave ``llm_feedback`` empty and skip this.
477+
if feedback_snippet is not None:
478+
parts.append(f'feedback="{feedback_snippet}"')
436479
typer.echo(" " + " ".join(parts), err=True)
437480
emitted_for_session = True
438481

@@ -441,10 +484,12 @@ def _emit_evaluate_failures(
441484
# while the session itself is flagged failed (a bug upstream) — we
442485
# still point the reader at the session id.
443486
if not emitted_for_session:
444-
typer.echo(
445-
f" FAIL session={s.session_id} (no per-metric detail available)",
446-
err=True,
447-
)
487+
fallback = f" FAIL session={s.session_id}"
488+
if feedback_snippet is not None:
489+
fallback += f' feedback="{feedback_snippet}"'
490+
else:
491+
fallback += " (no per-metric detail available)"
492+
typer.echo(fallback, err=True)
448493
if len(failed) > len(shown):
449494
typer.echo(
450495
f" ... {len(failed) - len(shown)} more failing session(s) "

tests/test_cli.py

Lines changed: 237 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -420,6 +420,202 @@ def test_evaluate_exit_code_emits_fallback_with_no_details(self, mock_build):
420420
assert "metric=legacy_metric" in combined
421421
assert "score=0.3" in combined
422422

423+
@patch("bigquery_agent_analytics.cli._build_client")
424+
def test_evaluate_exit_code_llm_judge_emits_feedback_snippet(
425+
self, mock_build
426+
):
427+
"""LLM-judge failures expose ``SessionScore.llm_feedback`` in the
428+
FAIL line as a bounded ``feedback="..."`` snippet.
429+
430+
Without this, post #2's deterministic FAIL output story carries
431+
over to LLM-judge, but the differentiator vs. a hand-rolled judge
432+
("the score is *explained*") has nothing visible in CI logs.
433+
"""
434+
report = EvaluationReport(
435+
dataset="test",
436+
evaluator_name="correctness_judge",
437+
total_sessions=1,
438+
passed_sessions=0,
439+
failed_sessions=1,
440+
created_at=_NOW,
441+
session_scores=[
442+
SessionScore(
443+
session_id="bad",
444+
scores={"correctness": 0.3},
445+
passed=False,
446+
details={},
447+
llm_feedback=(
448+
"The agent confirmed a booking but the booking"
449+
" tool never ran for that session."
450+
),
451+
),
452+
],
453+
)
454+
client = MagicMock()
455+
client.evaluate.return_value = report
456+
mock_build.return_value = client
457+
458+
result = runner.invoke(
459+
app,
460+
[
461+
"evaluate",
462+
"--project-id=proj",
463+
"--dataset-id=ds",
464+
"--evaluator=llm-judge",
465+
"--criterion=correctness",
466+
"--exit-code",
467+
],
468+
)
469+
assert result.exit_code == 1
470+
combined = (result.stderr or "") + (result.output or "")
471+
# Existing fields still present.
472+
assert "FAIL session=bad" in combined
473+
assert "metric=correctness" in combined
474+
assert "score=0.3" in combined
475+
# Feedback snippet appears, quoted, with the actual justification.
476+
assert 'feedback="' in combined
477+
assert "booking tool never ran" in combined
478+
479+
@patch("bigquery_agent_analytics.cli._build_client")
480+
def test_evaluate_exit_code_llm_judge_truncates_long_feedback(
481+
self, mock_build
482+
):
483+
"""Justifications longer than the snippet bound are truncated with U+2026."""
484+
long_feedback = "word " * 200 # ~1000 chars
485+
report = EvaluationReport(
486+
dataset="test",
487+
evaluator_name="correctness_judge",
488+
total_sessions=1,
489+
passed_sessions=0,
490+
failed_sessions=1,
491+
created_at=_NOW,
492+
session_scores=[
493+
SessionScore(
494+
session_id="bad",
495+
scores={"correctness": 0.0},
496+
passed=False,
497+
details={},
498+
llm_feedback=long_feedback,
499+
),
500+
],
501+
)
502+
client = MagicMock()
503+
client.evaluate.return_value = report
504+
mock_build.return_value = client
505+
506+
result = runner.invoke(
507+
app,
508+
[
509+
"evaluate",
510+
"--project-id=proj",
511+
"--dataset-id=ds",
512+
"--evaluator=llm-judge",
513+
"--exit-code",
514+
],
515+
)
516+
assert result.exit_code == 1
517+
combined = (result.stderr or "") + (result.output or "")
518+
# Look at the FAIL line itself: feedback="...". The snippet stays
519+
# under the configured cap (120 chars between the quotes).
520+
fail_line = next(
521+
line for line in combined.splitlines() if line.startswith(" FAIL")
522+
)
523+
assert 'feedback="' in fail_line
524+
quoted = fail_line.split('feedback="', 1)[1].rsplit('"', 1)[0]
525+
assert len(quoted) <= 120
526+
assert quoted.endswith("\u2026")
527+
528+
@patch("bigquery_agent_analytics.cli._build_client")
529+
def test_evaluate_exit_code_collapses_newlines_in_feedback(self, mock_build):
530+
"""Multi-line judge feedback collapses to a single CI log line."""
531+
report = EvaluationReport(
532+
dataset="test",
533+
evaluator_name="correctness_judge",
534+
total_sessions=1,
535+
passed_sessions=0,
536+
failed_sessions=1,
537+
created_at=_NOW,
538+
session_scores=[
539+
SessionScore(
540+
session_id="bad",
541+
scores={"correctness": 0.2},
542+
passed=False,
543+
details={},
544+
llm_feedback="Line one.\nLine two.\n\nLine three.",
545+
),
546+
],
547+
)
548+
client = MagicMock()
549+
client.evaluate.return_value = report
550+
mock_build.return_value = client
551+
552+
result = runner.invoke(
553+
app,
554+
[
555+
"evaluate",
556+
"--project-id=proj",
557+
"--dataset-id=ds",
558+
"--evaluator=llm-judge",
559+
"--exit-code",
560+
],
561+
)
562+
assert result.exit_code == 1
563+
combined = (result.stderr or "") + (result.output or "")
564+
fail_line = next(
565+
line for line in combined.splitlines() if line.startswith(" FAIL")
566+
)
567+
quoted = fail_line.split('feedback="', 1)[1].rsplit('"', 1)[0]
568+
assert "Line one. Line two. Line three." == quoted
569+
570+
@patch("bigquery_agent_analytics.cli._build_client")
571+
def test_evaluate_exit_code_code_metric_omits_feedback(self, mock_build):
572+
"""Code-based metrics leave llm_feedback empty -> no feedback field."""
573+
report = EvaluationReport(
574+
dataset="test",
575+
evaluator_name="latency_evaluator",
576+
total_sessions=1,
577+
passed_sessions=0,
578+
failed_sessions=1,
579+
created_at=_NOW,
580+
session_scores=[
581+
SessionScore(
582+
session_id="bad",
583+
scores={"latency": 0.0},
584+
passed=False,
585+
details={
586+
"metric_latency": {
587+
"observed": 7000,
588+
"budget": 5000,
589+
"threshold": 1.0,
590+
"score": 0.0,
591+
"passed": False,
592+
}
593+
},
594+
llm_feedback=None,
595+
),
596+
],
597+
)
598+
client = MagicMock()
599+
client.evaluate.return_value = report
600+
mock_build.return_value = client
601+
602+
result = runner.invoke(
603+
app,
604+
[
605+
"evaluate",
606+
"--project-id=proj",
607+
"--dataset-id=ds",
608+
"--evaluator=latency",
609+
"--exit-code",
610+
],
611+
)
612+
assert result.exit_code == 1
613+
combined = (result.stderr or "") + (result.output or "")
614+
assert "observed=7000" in combined
615+
assert "budget=5000" in combined
616+
# No feedback field should be emitted for code-based metrics.
617+
assert "feedback=" not in combined
618+
423619
@patch("bigquery_agent_analytics.cli._build_client")
424620
def test_evaluate_exit_code_on_pass(self, mock_build):
425621
client = MagicMock()
@@ -567,6 +763,47 @@ def test_evaluate_infra_error_exit_2(self, mock_build):
567763
assert result.exit_code == 2
568764

569765

766+
class TestFormatFeedbackSnippet:
767+
"""Direct unit tests for _format_feedback_snippet."""
768+
769+
def test_none_input_returns_none(self):
770+
from bigquery_agent_analytics.cli import _format_feedback_snippet
771+
772+
assert _format_feedback_snippet(None) is None
773+
774+
def test_empty_input_returns_none(self):
775+
from bigquery_agent_analytics.cli import _format_feedback_snippet
776+
777+
assert _format_feedback_snippet("") is None
778+
assert _format_feedback_snippet(" \n\t ") is None
779+
780+
def test_short_input_passes_through_unchanged(self):
781+
from bigquery_agent_analytics.cli import _format_feedback_snippet
782+
783+
assert _format_feedback_snippet("Short and useful.") == "Short and useful."
784+
785+
def test_collapses_internal_whitespace_runs(self):
786+
from bigquery_agent_analytics.cli import _format_feedback_snippet
787+
788+
out = _format_feedback_snippet("First.\n\n Second.\tThird.")
789+
assert out == "First. Second. Third."
790+
791+
def test_truncates_with_ellipsis_at_max_chars(self):
792+
from bigquery_agent_analytics.cli import _format_feedback_snippet
793+
794+
text = "x" * 500
795+
out = _format_feedback_snippet(text, max_chars=120)
796+
assert len(out) == 120
797+
assert out.endswith("\u2026")
798+
799+
def test_max_chars_param_respected(self):
800+
from bigquery_agent_analytics.cli import _format_feedback_snippet
801+
802+
out = _format_feedback_snippet("y" * 200, max_chars=50)
803+
assert len(out) == 50
804+
assert out.endswith("\u2026")
805+
806+
570807
# ------------------------------------------------------------------ #
571808
# env var fallback #
572809
# ------------------------------------------------------------------ #

0 commit comments

Comments
 (0)