feat(evaluate): LLM-judge FAIL output with bounded feedback + --strict docs (#86)

caohy1988 · haiyuan-eng-google · web-flow · commit 5db449c99ed6 · 2026-04-27T12:01:39.000-07:00
* fix(llm-judge): full prompt parity + execution_mode/fallback_reason (#42) * fix(llm-judge): full prompt parity + execution_mode/fallback_reason Two coupled fixes for LLM-as-Judge: the BQ-native paths now see the full Python prompt template, and the resulting report stamps which path actually ran (and why earlier tiers fell back when applicable). ## Prompt parity (F1) `_ai_generate_judge` and `_bqml_judge` previously sent only `prompt_template.split("{trace_text}")[0]` to BigQuery — i.e., the prefix up to the first placeholder. Everything after `{trace_text}` in the Python template (including the per-criterion JSON output spec the judge model needs to score consistently) was silently dropped on the SQL paths. That made AI.GENERATE / ML.GENERATE_TEXT score against a different prompt than the API-fallback path, which uses the whole template via `str.format(...)`. Fix: - New helper `evaluators.split_judge_prompt_template(template)` that format()s the template with `\x00`-bracketed sentinels for both placeholders, then partitions the result into `(prefix, middle, suffix)`. Sentinels avoid clashing with literal template content; running the format pass ensures `{{...}}` escapes are correctly un-escaped before partitioning, so the SQL CONCAT sees the same string the API path produces. - `AI_GENERATE_JUDGE_BATCH_QUERY` and `_LEGACY_LLM_JUDGE_BATCH_QUERY` now `CONCAT(@judge_prompt_prefix, trace_text, @judge_prompt_middle, COALESCE(final_response, 'N/A'), @judge_prompt_suffix)` — three parameters instead of one. - Both judge methods in `client.py` swap `judge_prompt` for the three new parameters. ## execution_mode + fallback_reason (F2/F3) `_evaluate_llm_judge` now stamps `report.details["execution_mode"]` with one of `ai_generate`, `ml_generate_text`, `api_fallback`, or `no_op` — matching the value-space the categorical evaluator already uses. When an earlier tier raises before a later tier succeeds, `report.details["fallback_reason"]` carries the chained exception messages in attempt order so CI gates and dashboards can audit which path actually ran. Categorical-style underscore naming is intentional — readers reading both LLM-judge and categorical reports see the same vocabulary. ## Tests - Prompt parity: assert AI.GENERATE and ML.GENERATE_TEXT receive three `judge_prompt_{prefix,middle,suffix}` params instead of the single `judge_prompt`, and that concatenation reproduces the full Python template (including the JSON output spec). - Execution mode: assert each of ai_generate / ml_generate_text / api_fallback fires under the right cascade conditions, and that `fallback_reason` names the prior tiers in attempt order. - `split_judge_prompt_template`: round-trip, missing-placeholder fallback paths, full-template-as-prefix when neither placeholder is present. CHANGELOG entry added under `[Unreleased]`. Required publish blocker for blog post #3 (#82). PR #2 in this series will tighten the LLM-judge `evaluate --exit-code` FAIL output to surface criterion + threshold + bounded `llm_feedback` snippet. Ref: #82, #51. * fix(llm-judge): keep synthesized labels next to their values Reviewer flagged that ``split_judge_prompt_template`` mishandles custom templates with one placeholder. The SQL CONCAT runs prefix ++ trace_text ++ middle ++ final_response ++ suffix so a synthesized label for an absent placeholder must end up *immediately before* the value it labels. Earlier fallback branches placed the labels on the wrong side: - ``{trace_text}`` only — ``Response:`` label landed AFTER the injected response value. - ``{final_response}`` only — ``Trace:`` label landed AFTER the injected trace value, and the user's prompt prose ended up AFTER the trace instead of before it. - No placeholders — labels appended after their values for both trace and response. Built-in correctness/hallucination/sentiment templates were unaffected because they declare both placeholders explicitly, so the dual-placeholder branch (which was correct) handled them. Fix: rewrite the three fallback branches so the SQL CONCAT yields ``...<user prompt>...\nTrace:\n<TRACE>\nResponse:\n<RESPONSE>...`` in every case. Updated the docstring to document the rebuild contract precisely. Tests: replaced the two-segment "label is somewhere in suffix" assertions with full-rebuild assertions that verify ordering of user prose, ``Trace:`` label, trace value, ``Response:`` label, and response value. Three new regression tests cover all three fallback branches. Ref: PR #42 review feedback. * feat(evaluate): LLM-judge FAIL output with bounded feedback + --strict docs (#44) * feat(evaluate): LLM-judge FAIL output with bounded feedback + --strict docs Builds on the runtime-behavior fixes from #42 (now merged on main): the report shape this PR depends on — ``SessionScore.llm_feedback`` populated by both BQ-native and API-fallback judge paths, and ``EvaluationReport.details["execution_mode"]`` distinguishing the three tiers — is stable as of 0.2.2 + #42. ## F5 — feedback snippet on LLM-judge FAIL lines ``_emit_evaluate_failures`` now appends a bounded ``feedback="…"`` field after ``score`` / ``threshold`` whenever a failing session's ``SessionScore.llm_feedback`` is non-empty. The snippet: - Collapses internal whitespace runs (newlines, tabs, doubled spaces) to a single space, so multi-line judge justifications stay on one CI log line. - Truncates at 120 characters with a U+2026 ellipsis when the collapsed string is longer. - Is omitted entirely for code-based metrics (their ``llm_feedback`` is ``None``), so post #2's deterministic FAIL output stays visually identical. Pulled the formatter into a small public-internal helper (``_format_feedback_snippet``) so the CLI tests can exercise it directly without round-tripping through Click. The safety-net "no per-metric detail available" branch also carries the snippet now — failing sessions with no metric details still surface the judge's reason rather than a generic placeholder. ## F4 — ``--strict`` help/docs match shipped behavior The previous ``--strict`` help text ("Fail sessions with unparseable judge output") promised a broader effect than what ships. In reality: - API-fallback parse errors already coerce to ``score=0.0`` and fail any non-zero threshold without ``--strict``. - ``--strict`` only changes AI.GENERATE rows whose typed output is empty/NULL: without it they're silently skipped (the default ``passed=True`` for empty-scores SessionScores); with it they're explicitly failed and counted under ``report.details``. Updated the CLI help string to name AI.GENERATE specifically and to call out that API-fallback parse errors don't need the flag. Rewrote ``SDK.md §4 Strict Mode`` to lead with the AI.GENERATE path scope before getting into operational counters. ## Tests (10 new) - LLM-judge FAIL line carries the feedback snippet, with the judge's actual prose visible in CI output. - Long feedback truncates at the 120-char cap with U+2026. - Multi-line feedback collapses to a single CI log line. - Code-based metric failures emit no ``feedback`` field (regression guard against bleed from the LLM path). - Direct unit tests on ``_format_feedback_snippet``: ``None`` / empty / whitespace-only inputs return ``None``; short inputs pass through; whitespace runs collapse; default and custom ``max_chars`` truncation respected. CHANGELOG entry added under ``[Unreleased]`` (Added + Changed sections). Required publish blocker for blog post #3 (#82). Pairs with #42 as PR 2 of 2 in that series. Ref: #82, #51. * docs(strict): describe parse-error visibility, not pass/fail flipping Reviewer flagged that the prior --strict rewrite still mischaracterized the AI.GENERATE behavior: it claimed strict=False silently skips empty/NULL typed-output rows and leaves them passing, then says strict=True flips them to failing. That's wrong on both ends. Actual code path: - ``_ai_generate_judge`` (and ``_bqml_judge``) compute ``passed = bool(scores) and all(score >= threshold ...)``. Empty-scores rows already have passed=False — locked in by TestFalsePassFix.test_empty_score_fails on the runtime side. - ``_apply_strict_mode`` walks the report and adds ``details['parse_error']=True`` to every empty-scores session plus a report-level ``parse_errors`` / ``parse_error_rate`` counter. It does not change any session's pass/fail status. - The API-fallback path coerces malformed output to ``score=0.0``, so its parse failures fail as low-score failures and don't surface through ``--strict``. Net: ``--strict`` is a visibility knob, not a gate-affecting flag. For pass/fail-only CI consumers it's a no-op; for dashboards or investigations it lets you tell ``low score`` failures apart from ``no parseable score`` failures. Updated: - ``cli.py:strict`` help text — replaces the misleading "fail/silently-skipped" framing with the parse-error metadata framing the code actually implements. - ``SDK.md §4 Strict Mode`` — leads with "adds parse-error visibility, does not flip pass/fail," then enumerates exactly what ``_apply_strict_mode`` does. - ``CHANGELOG.md`` ``[Unreleased]`` entry — same correction. No code or test changes — only docs/help. Existing pytest run still 2080 passed, 4 skipped. Ref: PR #44 review feedback. --------- Co-authored-by: Haiyuan Cao <haiyuan@google.com>
diff --git a/SDK.md b/SDK.md
@@ -275,7 +275,36 @@ print(report.summary())
 
 ### Strict Mode
 
-When `strict=True`, sessions where the LLM judge returns empty or unparseable output are marked as **failed** instead of silently passing. Operational counters are placed in `report.details` (not `aggregate_scores`) so downstream consumers can treat scores as purely normalized metrics:
+`strict=True` adds **parse-error visibility** — it does not flip
+any session's pass/fail outcome. Both BQ-native judge methods set
+`passed = bool(scores) and all(score >= threshold for score in
+scores.values())`, so a row whose `scores` dict is empty (the
+judge model returned no parseable output) already fails. Without
+`strict=True` you can't tell from the report whether a failed
+session failed because the judge gave a low score or because the
+judge gave nothing parseable at all.
+
+`strict=True` walks the merged report and:
+
+- Stamps `SessionScore.details["parse_error"] = True` on every
+  session whose `scores` dict is empty.
+- Adds a report-level `details["parse_errors"]` count plus
+  `details["parse_error_rate"]` (fraction of `total_sessions`).
+
+The API-fallback path coerces malformed model output to
+`score=0.0` and always populates `scores`, so its failures look
+like low-score failures rather than parse errors. `strict=True`
+won't surface them as parse errors today; it's an AI.GENERATE /
+ML.GENERATE_TEXT visibility knob in practice.
+
+For pass/fail-only consumers (CI gates with `--exit-code`),
+`strict=True` is a no-op. Reach for it when a dashboard or
+investigation needs to distinguish "no parseable score" from
+"low score" failures.
+
+Operational counters are placed in `report.details` (not
+`aggregate_scores`) so downstream consumers can treat scores as
+purely normalized metrics:
 
 ```python
 report = client.evaluate(
diff --git a/src/bigquery_agent_analytics/cli.py b/src/bigquery_agent_analytics/cli.py
@@ -302,7 +302,14 @@ def evaluate(
     ),
     strict: bool = typer.Option(
         False,
-        help="Fail sessions with unparseable judge output.",
+        help=(
+            "Stamp parse-error metadata on AI.GENERATE judge rows with"
+            " empty or NULL typed output. Those rows already fail"
+            " (empty score < threshold); --strict adds"
+            " details['parse_error']=True and a report-level"
+            " parse_errors counter so dashboards can tell 'no"
+            " parseable score' apart from 'low score' failures."
+        ),
     ),
     endpoint: Optional[str] = typer.Option(
         None,
@@ -368,6 +375,31 @@ def evaluate(
     raise typer.Exit(code=2)
 
 
+_FEEDBACK_SNIPPET_MAX = 120
+
+
+def _format_feedback_snippet(
+    feedback: Optional[str], max_chars: int = _FEEDBACK_SNIPPET_MAX
+) -> Optional[str]:
+  """Return a single-line, bounded snippet of an LLM-judge justification.
+
+  Collapses internal whitespace runs (including newlines) to a single
+  space so the snippet fits on one CI log line, then truncates to
+  ``max_chars`` with a trailing ``…`` when the original was longer.
+  Returns ``None`` for empty / whitespace-only input so callers can
+  cleanly skip the field.
+  """
+  if not feedback:
+    return None
+  collapsed = " ".join(feedback.split())
+  if not collapsed:
+    return None
+  if len(collapsed) <= max_chars:
+    return collapsed
+  # Reserve one char for the ellipsis to keep the visual width capped.
+  return collapsed[: max_chars - 1].rstrip() + "\u2026"
+
+
 def _emit_evaluate_failures(
     report: EvaluationReport, max_sessions: int = 10
 ) -> None:
@@ -377,10 +409,14 @@ def _emit_evaluate_failures(
   Prefers the raw observed + budget pair (``CodeEvaluator`` prebuilts);
   falls back to score + threshold when the metric didn't declare
   observed/budget (custom ``add_metric`` users, ``LLMAsJudge``
-  criteria). A failing session is guaranteed to produce at least one
-  FAIL line — never just the summary header.
-
-  Capped at ``max_sessions`` most-recent failures so CI logs stay scannable.
+  criteria). For LLM-judge failures the line also carries a bounded
+  ``feedback="…"`` snippet drawn from ``SessionScore.llm_feedback``
+  so CI logs explain *why* the judge said the session failed without
+  forcing the reader to chase the JSON output.
+
+  A failing session is guaranteed to produce at least one FAIL line —
+  never just the summary header. Capped at ``max_sessions`` most-recent
+  failures so CI logs stay scannable.
   """
   failed = [s for s in report.session_scores if not s.passed]
   if not failed:
@@ -393,6 +429,7 @@ def _emit_evaluate_failures(
   )
   shown = failed[:max_sessions]
   for s in shown:
+    feedback_snippet = _format_feedback_snippet(s.llm_feedback)
     emitted_for_session = False
     for metric_name, score in s.scores.items():
       detail = s.details.get(f"metric_{metric_name}") or {}
@@ -433,6 +470,12 @@ def _emit_evaluate_failures(
       parts.append(f"score={score:.4g}")
       if threshold is not None and isinstance(threshold, (int, float)):
         parts.append(f"threshold={threshold:.4g}")
+      # LLM judges populate ``SessionScore.llm_feedback`` with the
+      # judge's justification. Surface a bounded snippet on the FAIL
+      # line so CI logs explain *why* without dumping the full JSON.
+      # Code-based metrics leave ``llm_feedback`` empty and skip this.
+      if feedback_snippet is not None:
+        parts.append(f'feedback="{feedback_snippet}"')
       typer.echo("  " + " ".join(parts), err=True)
       emitted_for_session = True
 
@@ -441,10 +484,12 @@ def _emit_evaluate_failures(
     # while the session itself is flagged failed (a bug upstream) — we
     # still point the reader at the session id.
     if not emitted_for_session:
-      typer.echo(
-          f"  FAIL session={s.session_id} (no per-metric detail available)",
-          err=True,
-      )
+      fallback = f"  FAIL session={s.session_id}"
+      if feedback_snippet is not None:
+        fallback += f' feedback="{feedback_snippet}"'
+      else:
+        fallback += " (no per-metric detail available)"
+      typer.echo(fallback, err=True)
   if len(failed) > len(shown):
     typer.echo(
         f"  ... {len(failed) - len(shown)} more failing session(s) "
diff --git a/tests/test_cli.py b/tests/test_cli.py
@@ -420,6 +420,202 @@ def test_evaluate_exit_code_emits_fallback_with_no_details(self, mock_build):
     assert "metric=legacy_metric" in combined
     assert "score=0.3" in combined
 
+  @patch("bigquery_agent_analytics.cli._build_client")
+  def test_evaluate_exit_code_llm_judge_emits_feedback_snippet(
+      self, mock_build
+  ):
+    """LLM-judge failures expose ``SessionScore.llm_feedback`` in the
+    FAIL line as a bounded ``feedback="..."`` snippet.
+
+    Without this, post #2's deterministic FAIL output story carries
+    over to LLM-judge, but the differentiator vs. a hand-rolled judge
+    ("the score is *explained*") has nothing visible in CI logs.
+    """
+    report = EvaluationReport(
+        dataset="test",
+        evaluator_name="correctness_judge",
+        total_sessions=1,
+        passed_sessions=0,
+        failed_sessions=1,
+        created_at=_NOW,
+        session_scores=[
+            SessionScore(
+                session_id="bad",
+                scores={"correctness": 0.3},
+                passed=False,
+                details={},
+                llm_feedback=(
+                    "The agent confirmed a booking but the booking"
+                    " tool never ran for that session."
+                ),
+            ),
+        ],
+    )
+    client = MagicMock()
+    client.evaluate.return_value = report
+    mock_build.return_value = client
+
+    result = runner.invoke(
+        app,
+        [
+            "evaluate",
+            "--project-id=proj",
+            "--dataset-id=ds",
+            "--evaluator=llm-judge",
+            "--criterion=correctness",
+            "--exit-code",
+        ],
+    )
+    assert result.exit_code == 1
+    combined = (result.stderr or "") + (result.output or "")
+    # Existing fields still present.
+    assert "FAIL session=bad" in combined
+    assert "metric=correctness" in combined
+    assert "score=0.3" in combined
+    # Feedback snippet appears, quoted, with the actual justification.
+    assert 'feedback="' in combined
+    assert "booking tool never ran" in combined
+
+  @patch("bigquery_agent_analytics.cli._build_client")
+  def test_evaluate_exit_code_llm_judge_truncates_long_feedback(
+      self, mock_build
+  ):
+    """Justifications longer than the snippet bound are truncated with U+2026."""
+    long_feedback = "word " * 200  # ~1000 chars
+    report = EvaluationReport(
+        dataset="test",
+        evaluator_name="correctness_judge",
+        total_sessions=1,
+        passed_sessions=0,
+        failed_sessions=1,
+        created_at=_NOW,
+        session_scores=[
+            SessionScore(
+                session_id="bad",
+                scores={"correctness": 0.0},
+                passed=False,
+                details={},
+                llm_feedback=long_feedback,
+            ),
+        ],
+    )
+    client = MagicMock()
+    client.evaluate.return_value = report
+    mock_build.return_value = client
+
+    result = runner.invoke(
+        app,
+        [
+            "evaluate",
+            "--project-id=proj",
+            "--dataset-id=ds",
+            "--evaluator=llm-judge",
+            "--exit-code",
+        ],
+    )
+    assert result.exit_code == 1
+    combined = (result.stderr or "") + (result.output or "")
+    # Look at the FAIL line itself: feedback="...". The snippet stays
+    # under the configured cap (120 chars between the quotes).
+    fail_line = next(
+        line for line in combined.splitlines() if line.startswith("  FAIL")
+    )
+    assert 'feedback="' in fail_line
+    quoted = fail_line.split('feedback="', 1)[1].rsplit('"', 1)[0]
+    assert len(quoted) <= 120
+    assert quoted.endswith("\u2026")
+
+  @patch("bigquery_agent_analytics.cli._build_client")
+  def test_evaluate_exit_code_collapses_newlines_in_feedback(self, mock_build):
+    """Multi-line judge feedback collapses to a single CI log line."""
+    report = EvaluationReport(
+        dataset="test",
+        evaluator_name="correctness_judge",
+        total_sessions=1,
+        passed_sessions=0,
+        failed_sessions=1,
+        created_at=_NOW,
+        session_scores=[
+            SessionScore(
+                session_id="bad",
+                scores={"correctness": 0.2},
+                passed=False,
+                details={},
+                llm_feedback="Line one.\nLine two.\n\nLine three.",
+            ),
+        ],
+    )
+    client = MagicMock()
+    client.evaluate.return_value = report
+    mock_build.return_value = client
+
+    result = runner.invoke(
+        app,
+        [
+            "evaluate",
+            "--project-id=proj",
+            "--dataset-id=ds",
+            "--evaluator=llm-judge",
+            "--exit-code",
+        ],
+    )
+    assert result.exit_code == 1
+    combined = (result.stderr or "") + (result.output or "")
+    fail_line = next(
+        line for line in combined.splitlines() if line.startswith("  FAIL")
+    )
+    quoted = fail_line.split('feedback="', 1)[1].rsplit('"', 1)[0]
+    assert "Line one. Line two. Line three." == quoted
+
+  @patch("bigquery_agent_analytics.cli._build_client")
+  def test_evaluate_exit_code_code_metric_omits_feedback(self, mock_build):
+    """Code-based metrics leave llm_feedback empty -> no feedback field."""
+    report = EvaluationReport(
+        dataset="test",
+        evaluator_name="latency_evaluator",
+        total_sessions=1,
+        passed_sessions=0,
+        failed_sessions=1,
+        created_at=_NOW,
+        session_scores=[
+            SessionScore(
+                session_id="bad",
+                scores={"latency": 0.0},
+                passed=False,
+                details={
+                    "metric_latency": {
+                        "observed": 7000,
+                        "budget": 5000,
+                        "threshold": 1.0,
+                        "score": 0.0,
+                        "passed": False,
+                    }
+                },
+                llm_feedback=None,
+            ),
+        ],
+    )
+    client = MagicMock()
+    client.evaluate.return_value = report
+    mock_build.return_value = client
+
+    result = runner.invoke(
+        app,
+        [
+            "evaluate",
+            "--project-id=proj",
+            "--dataset-id=ds",
+            "--evaluator=latency",
+            "--exit-code",
+        ],
+    )
+    assert result.exit_code == 1
+    combined = (result.stderr or "") + (result.output or "")
+    assert "observed=7000" in combined
+    assert "budget=5000" in combined
+    # No feedback field should be emitted for code-based metrics.
+    assert "feedback=" not in combined
+
   @patch("bigquery_agent_analytics.cli._build_client")
   def test_evaluate_exit_code_on_pass(self, mock_build):
     client = MagicMock()
@@ -567,6 +763,47 @@ def test_evaluate_infra_error_exit_2(self, mock_build):
     assert result.exit_code == 2
 
 
+class TestFormatFeedbackSnippet:
+  """Direct unit tests for _format_feedback_snippet."""
+
+  def test_none_input_returns_none(self):
+    from bigquery_agent_analytics.cli import _format_feedback_snippet
+
+    assert _format_feedback_snippet(None) is None
+
+  def test_empty_input_returns_none(self):
+    from bigquery_agent_analytics.cli import _format_feedback_snippet
+
+    assert _format_feedback_snippet("") is None
+    assert _format_feedback_snippet("   \n\t  ") is None
+
+  def test_short_input_passes_through_unchanged(self):
+    from bigquery_agent_analytics.cli import _format_feedback_snippet
+
+    assert _format_feedback_snippet("Short and useful.") == "Short and useful."
+
+  def test_collapses_internal_whitespace_runs(self):
+    from bigquery_agent_analytics.cli import _format_feedback_snippet
+
+    out = _format_feedback_snippet("First.\n\n  Second.\tThird.")
+    assert out == "First. Second. Third."
+
+  def test_truncates_with_ellipsis_at_max_chars(self):
+    from bigquery_agent_analytics.cli import _format_feedback_snippet
+
+    text = "x" * 500
+    out = _format_feedback_snippet(text, max_chars=120)
+    assert len(out) == 120
+    assert out.endswith("\u2026")
+
+  def test_max_chars_param_respected(self):
+    from bigquery_agent_analytics.cli import _format_feedback_snippet
+
+    out = _format_feedback_snippet("y" * 200, max_chars=50)
+    assert len(out) == 50
+    assert out.endswith("\u2026")
+
+
 # ------------------------------------------------------------------ #
 # env var fallback                                                     #
 # ------------------------------------------------------------------ #