fix(llm-judge): full prompt parity + execution_mode/fallback_reason (#42) (#85)

caohy1988 · web-flow · commit 7baa2b0f0fb6 · 2026-04-27T11:57:46.000-07:00
* fix(llm-judge): full prompt parity + execution_mode/fallback_reason Two coupled fixes for LLM-as-Judge: the BQ-native paths now see the full Python prompt template, and the resulting report stamps which path actually ran (and why earlier tiers fell back when applicable). ## Prompt parity (F1) `_ai_generate_judge` and `_bqml_judge` previously sent only `prompt_template.split("{trace_text}")[0]` to BigQuery — i.e., the prefix up to the first placeholder. Everything after `{trace_text}` in the Python template (including the per-criterion JSON output spec the judge model needs to score consistently) was silently dropped on the SQL paths. That made AI.GENERATE / ML.GENERATE_TEXT score against a different prompt than the API-fallback path, which uses the whole template via `str.format(...)`. Fix: - New helper `evaluators.split_judge_prompt_template(template)` that format()s the template with `\x00`-bracketed sentinels for both placeholders, then partitions the result into `(prefix, middle, suffix)`. Sentinels avoid clashing with literal template content; running the format pass ensures `{{...}}` escapes are correctly un-escaped before partitioning, so the SQL CONCAT sees the same string the API path produces. - `AI_GENERATE_JUDGE_BATCH_QUERY` and `_LEGACY_LLM_JUDGE_BATCH_QUERY` now `CONCAT(@judge_prompt_prefix, trace_text, @judge_prompt_middle, COALESCE(final_response, 'N/A'), @judge_prompt_suffix)` — three parameters instead of one. - Both judge methods in `client.py` swap `judge_prompt` for the three new parameters. ## execution_mode + fallback_reason (F2/F3) `_evaluate_llm_judge` now stamps `report.details["execution_mode"]` with one of `ai_generate`, `ml_generate_text`, `api_fallback`, or `no_op` — matching the value-space the categorical evaluator already uses. When an earlier tier raises before a later tier succeeds, `report.details["fallback_reason"]` carries the chained exception messages in attempt order so CI gates and dashboards can audit which path actually ran. Categorical-style underscore naming is intentional — readers reading both LLM-judge and categorical reports see the same vocabulary. ## Tests - Prompt parity: assert AI.GENERATE and ML.GENERATE_TEXT receive three `judge_prompt_{prefix,middle,suffix}` params instead of the single `judge_prompt`, and that concatenation reproduces the full Python template (including the JSON output spec). - Execution mode: assert each of ai_generate / ml_generate_text / api_fallback fires under the right cascade conditions, and that `fallback_reason` names the prior tiers in attempt order. - `split_judge_prompt_template`: round-trip, missing-placeholder fallback paths, full-template-as-prefix when neither placeholder is present. CHANGELOG entry added under `[Unreleased]`. Required publish blocker for blog post #3 (#82). PR #2 in this series will tighten the LLM-judge `evaluate --exit-code` FAIL output to surface criterion + threshold + bounded `llm_feedback` snippet. Ref: #82, #51. * fix(llm-judge): keep synthesized labels next to their values Reviewer flagged that ``split_judge_prompt_template`` mishandles custom templates with one placeholder. The SQL CONCAT runs prefix ++ trace_text ++ middle ++ final_response ++ suffix so a synthesized label for an absent placeholder must end up *immediately before* the value it labels. Earlier fallback branches placed the labels on the wrong side: - ``{trace_text}`` only — ``Response:`` label landed AFTER the injected response value. - ``{final_response}`` only — ``Trace:`` label landed AFTER the injected trace value, and the user's prompt prose ended up AFTER the trace instead of before it. - No placeholders — labels appended after their values for both trace and response. Built-in correctness/hallucination/sentiment templates were unaffected because they declare both placeholders explicitly, so the dual-placeholder branch (which was correct) handled them. Fix: rewrite the three fallback branches so the SQL CONCAT yields ``...<user prompt>...\nTrace:\n<TRACE>\nResponse:\n<RESPONSE>...`` in every case. Updated the docstring to document the rebuild contract precisely. Tests: replaced the two-segment "label is somewhere in suffix" assertions with full-rebuild assertions that verify ordering of user prose, ``Trace:`` label, trace value, ``Response:`` label, and response value. Three new regression tests cover all three fallback branches. Ref: PR #42 review feedback.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -7,6 +7,30 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ## [Unreleased]
 
+### Fixed
+
+- **LLM-as-Judge AI.GENERATE / ML.GENERATE_TEXT now uses the full Python
+  prompt template.** Previously both BQ-native paths sent only
+  ``prompt_template.split('{trace_text}')[0]`` to BigQuery, silently
+  dropping every instruction that followed the placeholders — including
+  the per-criterion output-format spec the judge model needs to score
+  consistently with the API-fallback path. The two BQ paths and the
+  Python API path now produce comparable scores against the same prompt.
+
+### Added
+
+- ``EvaluationReport.details["execution_mode"]`` is now populated for
+  LLM-as-Judge runs with one of ``ai_generate``, ``ml_generate_text``,
+  ``api_fallback``, or ``no_op`` — matching the value space the
+  categorical evaluator already exposes. When an earlier tier raised
+  before a later tier succeeded, ``details["fallback_reason"]`` carries
+  the chained exception messages in attempt order, so CI and dashboards
+  can audit which path actually ran.
+- ``evaluators.split_judge_prompt_template(prompt_template)`` is the
+  helper the SQL paths use to safely substitute the template into
+  ``CONCAT()``; exposed publicly for downstream code that needs the
+  same shape.
+
 ## [0.2.2] - 2026-04-24
 
 ### Changed (breaking)
diff --git a/src/bigquery_agent_analytics/client.py b/src/bigquery_agent_analytics/client.py
@@ -80,6 +80,7 @@
 from .evaluators import LLMAsJudge
 from .evaluators import SESSION_SUMMARY_QUERY
 from .evaluators import SessionScore
+from .evaluators import split_judge_prompt_template
 from .feedback import AnalysisConfig
 from .feedback import compute_drift
 from .feedback import compute_question_distribution
@@ -975,14 +976,27 @@ def _evaluate_llm_judge(
     then falls back to the Gemini API.  Each path evaluates
     every criterion in the evaluator and merges the per-session
     scores into a single report.
+
+    Stamps ``report.details["execution_mode"]`` with one of
+    ``ai_generate``, ``ml_generate_text``, ``api_fallback`` so the
+    caller (and CI gates) can audit which path actually ran.
+    When an earlier tier raised before a later tier succeeded,
+    ``report.details["fallback_reason"]`` carries the chained
+    exception messages in attempt order. (The naming mirrors the
+    categorical evaluator's ``execution_mode`` value space for
+    consistency.)
     """
     criteria = evaluator._criteria
     if not criteria:
-      return _build_report(
+      report = _build_report(
           evaluator_name=evaluator.name,
           dataset=f"{self._table_ref} WHERE {where}",
           session_scores=[],
       )
+      report.details["execution_mode"] = "no_op"
+      return report
+
+    fallback_reasons: list[str] = []
 
     # Try AI.GENERATE (new path) when endpoint is not a legacy ref
     if not self._is_legacy_model_ref(self.endpoint):
@@ -997,17 +1011,20 @@ def _evaluate_llm_judge(
               params,
           )
           criterion_reports.append((criterion, report))
-        return _merge_criterion_reports(
+        merged = _merge_criterion_reports(
             evaluator.name,
             f"{self._table_ref} WHERE {where}",
             criteria,
             criterion_reports,
         )
+        merged.details["execution_mode"] = "ai_generate"
+        return merged
       except Exception as e:
         logger.debug(
             "AI.GENERATE judge failed, trying legacy: %s",
             e,
         )
+        fallback_reasons.append(f"ai_generate: {e}")
 
     # Try legacy BQML batch evaluation
     text_model = (
@@ -1028,20 +1045,29 @@ def _evaluate_llm_judge(
             text_model,
         )
         criterion_reports.append((criterion, report))
-      return _merge_criterion_reports(
+      merged = _merge_criterion_reports(
           evaluator.name,
           f"{self._table_ref} WHERE {where}",
           criteria,
           criterion_reports,
       )
+      merged.details["execution_mode"] = "ml_generate_text"
+      if fallback_reasons:
+        merged.details["fallback_reason"] = "; ".join(fallback_reasons)
+      return merged
     except Exception as e:
       logger.debug(
           "BQML judge failed, falling back to API: %s",
           e,
       )
+      fallback_reasons.append(f"ml_generate_text: {e}")
 
     # Fallback: fetch traces using same table/filter, evaluate via API
-    return self._api_judge(evaluator, table, where, params)
+    api_report = self._api_judge(evaluator, table, where, params)
+    api_report.details["execution_mode"] = "api_fallback"
+    if fallback_reasons:
+      api_report.details["fallback_reason"] = "; ".join(fallback_reasons)
+    return api_report
 
   def _ai_generate_judge(
       self,
@@ -1054,12 +1080,13 @@ def _ai_generate_judge(
     """Evaluates using BigQuery AI.GENERATE with typed output."""
     from google.cloud import bigquery as bq
 
+    prefix, middle, suffix = split_judge_prompt_template(
+        criterion.prompt_template
+    )
     judge_params = list(params) + [
-        bq.ScalarQueryParameter(
-            "judge_prompt",
-            "STRING",
-            criterion.prompt_template.split("{trace_text}")[0],
-        ),
+        bq.ScalarQueryParameter("judge_prompt_prefix", "STRING", prefix),
+        bq.ScalarQueryParameter("judge_prompt_middle", "STRING", middle),
+        bq.ScalarQueryParameter("judge_prompt_suffix", "STRING", suffix),
     ]
 
     query = AI_GENERATE_JUDGE_BATCH_QUERY.format(
@@ -1121,12 +1148,13 @@ def _bqml_judge(
     """Evaluates using BigQuery ML.GENERATE_TEXT."""
     from google.cloud import bigquery as bq
 
+    prefix, middle, suffix = split_judge_prompt_template(
+        criterion.prompt_template
+    )
     judge_params = list(params) + [
-        bq.ScalarQueryParameter(
-            "judge_prompt",
-            "STRING",
-            criterion.prompt_template.split("{trace_text}")[0],
-        ),
+        bq.ScalarQueryParameter("judge_prompt_prefix", "STRING", prefix),
+        bq.ScalarQueryParameter("judge_prompt_middle", "STRING", middle),
+        bq.ScalarQueryParameter("judge_prompt_suffix", "STRING", suffix),
     ]
 
     query = LLM_JUDGE_BATCH_QUERY.format(
diff --git a/src/bigquery_agent_analytics/evaluators.py b/src/bigquery_agent_analytics/evaluators.py
@@ -894,9 +894,15 @@ def sentiment(
   result.*
 FROM session_traces,
 AI.GENERATE(
+  -- Substitute the full Python prompt_template at SQL time:
+  -- prefix ++ trace_text ++ middle ++ final_response ++ suffix.
+  -- Each segment is a separate query parameter so we preserve the
+  -- exact Python template (including the per-criterion output-format
+  -- spec) that the API-fallback path uses.
   prompt => CONCAT(
-    @judge_prompt, '\\nTrace:\\n', trace_text,
-    '\\nResponse:\\n', COALESCE(final_response, 'N/A')
+    @judge_prompt_prefix, trace_text,
+    @judge_prompt_middle, COALESCE(final_response, 'N/A'),
+    @judge_prompt_suffix
   ),
   endpoint => '{endpoint}',
   model_params => JSON '{{"temperature": 0.1, "max_output_tokens": 500}}',
@@ -938,9 +944,13 @@ def sentiment(
   ML.GENERATE_TEXT(
     MODEL `{model}`,
     STRUCT(
-      CONCAT(@judge_prompt, '\\nTrace:\\n', trace_text,
-             '\\nResponse:\\n', COALESCE(final_response, 'N/A'))
-      AS prompt
+      -- Same prefix/middle/suffix substitution as the AI.GENERATE
+      -- path; preserves the full Python prompt_template.
+      CONCAT(
+        @judge_prompt_prefix, trace_text,
+        @judge_prompt_middle, COALESCE(final_response, 'N/A'),
+        @judge_prompt_suffix
+      ) AS prompt
     ),
     STRUCT(0.1 AS temperature, 500 AS max_output_tokens)
   ).ml_generate_text_result AS evaluation
@@ -951,6 +961,99 @@ def sentiment(
 LLM_JUDGE_BATCH_QUERY = _LEGACY_LLM_JUDGE_BATCH_QUERY
 
 
+_TRACE_SENTINEL = "\x00__BQAA_JUDGE_TRACE__\x00"
+_RESPONSE_SENTINEL = "\x00__BQAA_JUDGE_RESPONSE__\x00"
+
+
+def split_judge_prompt_template(prompt_template: str) -> tuple[str, str, str]:
+  """Split a Python judge prompt into ``(prefix, middle, suffix)``.
+
+  The Python ``LLMAsJudge`` prompt template uses ``{trace_text}`` and
+  ``{final_response}`` placeholders (in that order) to interpolate
+  per-session inputs. The BigQuery-native ``AI.GENERATE`` and
+  ``ML.GENERATE_TEXT`` paths can't use Python ``str.format`` — they
+  build the prompt at SQL time. This helper returns the three
+  literal segments those SQL paths need to ``CONCAT`` together with
+  the SQL-side ``trace_text`` and ``final_response`` columns,
+  preserving the exact full template (including the per-criterion
+  output-format spec that follows the placeholders).
+
+  Internally the helper format()s the template once with sentinel
+  values, so any literal ``{{...}}`` braces in the source template
+  (e.g. the JSON output spec ``{{"correctness": <score>, ...}}``)
+  are correctly un-escaped before splitting. The SQL paths see the
+  same string the API-fallback path's ``str.format(...)`` would
+  produce.
+
+  Args:
+      prompt_template: The Python prompt template, expected to
+          contain both ``{trace_text}`` and ``{final_response}``
+          placeholders in that order.
+
+  Returns:
+      ``(prefix, middle, suffix)`` such that
+      ``prefix + trace_text + middle + final_response + suffix``
+      reproduces ``prompt_template.format(trace_text=..., final_response=...)``
+      for any inputs. When a placeholder is missing, the helper
+      synthesizes a labeled section for the missing input and
+      places the label *immediately before* the injected value
+      (label first, then value), so the model reads
+      ``...Trace:\n<TRACE>\nResponse:\n<RESPONSE>...`` rather than
+      the value followed by an orphan label.
+  """
+  has_trace = "{trace_text}" in prompt_template
+  has_response = "{final_response}" in prompt_template
+
+  # Reminder for the fallback branches below: the SQL CONCAT runs
+  #   prefix ++ trace_text ++ middle ++ final_response ++ suffix
+  # so any label we synthesize for an absent placeholder must end
+  # up *next to* the value it labels (label first, then value),
+  # not on the far side of it. Earlier versions appended labels
+  # *after* the values, which produced ``<TRACE>\nTrace:\n...``.
+
+  if not has_trace and not has_response:
+    # No placeholders at all. Append a labeled trace + response
+    # block after the user's instructions. The labels precede the
+    # values so the model reads them in order.
+    return (
+        prompt_template + "\nTrace:\n",
+        "\nResponse:\n",
+        "",
+    )
+
+  if not has_trace:
+    # final_response placeholder only. Honor the user's structure
+    # and inject a labeled trace block right before the response,
+    # so the trace label sits next to the trace.
+    formatted = prompt_template.format(final_response=_RESPONSE_SENTINEL)
+    before_response, _, after_response = formatted.partition(_RESPONSE_SENTINEL)
+    return (
+        before_response + "\nTrace:\n",
+        "\n",
+        after_response,
+    )
+
+  if not has_response:
+    # trace_text placeholder only. Append a labeled response block
+    # after the original template's tail, so the response label
+    # sits next to the response value (not after it).
+    formatted = prompt_template.format(trace_text=_TRACE_SENTINEL)
+    prefix, _, after_trace = formatted.partition(_TRACE_SENTINEL)
+    return (
+        prefix,
+        after_trace + "\nResponse:\n",
+        "",
+    )
+
+  formatted = prompt_template.format(
+      trace_text=_TRACE_SENTINEL,
+      final_response=_RESPONSE_SENTINEL,
+  )
+  prefix, _, rest = formatted.partition(_TRACE_SENTINEL)
+  middle, _, suffix = rest.partition(_RESPONSE_SENTINEL)
+  return prefix, middle, suffix
+
+
 # ------------------------------------------------------------------ #
 # Helpers                                                              #
 # ------------------------------------------------------------------ #
diff --git a/tests/test_sdk_client.py b/tests/test_sdk_client.py