Skip to content

Commit 7baa2b0

Browse files
authored
fix(llm-judge): full prompt parity + execution_mode/fallback_reason (#42) (#85)
* fix(llm-judge): full prompt parity + execution_mode/fallback_reason Two coupled fixes for LLM-as-Judge: the BQ-native paths now see the full Python prompt template, and the resulting report stamps which path actually ran (and why earlier tiers fell back when applicable). ## Prompt parity (F1) `_ai_generate_judge` and `_bqml_judge` previously sent only `prompt_template.split("{trace_text}")[0]` to BigQuery — i.e., the prefix up to the first placeholder. Everything after `{trace_text}` in the Python template (including the per-criterion JSON output spec the judge model needs to score consistently) was silently dropped on the SQL paths. That made AI.GENERATE / ML.GENERATE_TEXT score against a different prompt than the API-fallback path, which uses the whole template via `str.format(...)`. Fix: - New helper `evaluators.split_judge_prompt_template(template)` that format()s the template with `\x00`-bracketed sentinels for both placeholders, then partitions the result into `(prefix, middle, suffix)`. Sentinels avoid clashing with literal template content; running the format pass ensures `{{...}}` escapes are correctly un-escaped before partitioning, so the SQL CONCAT sees the same string the API path produces. - `AI_GENERATE_JUDGE_BATCH_QUERY` and `_LEGACY_LLM_JUDGE_BATCH_QUERY` now `CONCAT(@judge_prompt_prefix, trace_text, @judge_prompt_middle, COALESCE(final_response, 'N/A'), @judge_prompt_suffix)` — three parameters instead of one. - Both judge methods in `client.py` swap `judge_prompt` for the three new parameters. ## execution_mode + fallback_reason (F2/F3) `_evaluate_llm_judge` now stamps `report.details["execution_mode"]` with one of `ai_generate`, `ml_generate_text`, `api_fallback`, or `no_op` — matching the value-space the categorical evaluator already uses. When an earlier tier raises before a later tier succeeds, `report.details["fallback_reason"]` carries the chained exception messages in attempt order so CI gates and dashboards can audit which path actually ran. Categorical-style underscore naming is intentional — readers reading both LLM-judge and categorical reports see the same vocabulary. ## Tests - Prompt parity: assert AI.GENERATE and ML.GENERATE_TEXT receive three `judge_prompt_{prefix,middle,suffix}` params instead of the single `judge_prompt`, and that concatenation reproduces the full Python template (including the JSON output spec). - Execution mode: assert each of ai_generate / ml_generate_text / api_fallback fires under the right cascade conditions, and that `fallback_reason` names the prior tiers in attempt order. - `split_judge_prompt_template`: round-trip, missing-placeholder fallback paths, full-template-as-prefix when neither placeholder is present. CHANGELOG entry added under `[Unreleased]`. Required publish blocker for blog post #3 (#82). PR #2 in this series will tighten the LLM-judge `evaluate --exit-code` FAIL output to surface criterion + threshold + bounded `llm_feedback` snippet. Ref: #82, #51. * fix(llm-judge): keep synthesized labels next to their values Reviewer flagged that ``split_judge_prompt_template`` mishandles custom templates with one placeholder. The SQL CONCAT runs prefix ++ trace_text ++ middle ++ final_response ++ suffix so a synthesized label for an absent placeholder must end up *immediately before* the value it labels. Earlier fallback branches placed the labels on the wrong side: - ``{trace_text}`` only — ``Response:`` label landed AFTER the injected response value. - ``{final_response}`` only — ``Trace:`` label landed AFTER the injected trace value, and the user's prompt prose ended up AFTER the trace instead of before it. - No placeholders — labels appended after their values for both trace and response. Built-in correctness/hallucination/sentiment templates were unaffected because they declare both placeholders explicitly, so the dual-placeholder branch (which was correct) handled them. Fix: rewrite the three fallback branches so the SQL CONCAT yields ``...<user prompt>...\nTrace:\n<TRACE>\nResponse:\n<RESPONSE>...`` in every case. Updated the docstring to document the rebuild contract precisely. Tests: replaced the two-segment "label is somewhere in suffix" assertions with full-rebuild assertions that verify ordering of user prose, ``Trace:`` label, trace value, ``Response:`` label, and response value. Three new regression tests cover all three fallback branches. Ref: PR #42 review feedback.
1 parent e948580 commit 7baa2b0

4 files changed

Lines changed: 466 additions & 19 deletions

File tree

CHANGELOG.md

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,30 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
77

88
## [Unreleased]
99

10+
### Fixed
11+
12+
- **LLM-as-Judge AI.GENERATE / ML.GENERATE_TEXT now uses the full Python
13+
prompt template.** Previously both BQ-native paths sent only
14+
``prompt_template.split('{trace_text}')[0]`` to BigQuery, silently
15+
dropping every instruction that followed the placeholders — including
16+
the per-criterion output-format spec the judge model needs to score
17+
consistently with the API-fallback path. The two BQ paths and the
18+
Python API path now produce comparable scores against the same prompt.
19+
20+
### Added
21+
22+
- ``EvaluationReport.details["execution_mode"]`` is now populated for
23+
LLM-as-Judge runs with one of ``ai_generate``, ``ml_generate_text``,
24+
``api_fallback``, or ``no_op`` — matching the value space the
25+
categorical evaluator already exposes. When an earlier tier raised
26+
before a later tier succeeded, ``details["fallback_reason"]`` carries
27+
the chained exception messages in attempt order, so CI and dashboards
28+
can audit which path actually ran.
29+
- ``evaluators.split_judge_prompt_template(prompt_template)`` is the
30+
helper the SQL paths use to safely substitute the template into
31+
``CONCAT()``; exposed publicly for downstream code that needs the
32+
same shape.
33+
1034
## [0.2.2] - 2026-04-24
1135

1236
### Changed (breaking)

src/bigquery_agent_analytics/client.py

Lines changed: 42 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -80,6 +80,7 @@
8080
from .evaluators import LLMAsJudge
8181
from .evaluators import SESSION_SUMMARY_QUERY
8282
from .evaluators import SessionScore
83+
from .evaluators import split_judge_prompt_template
8384
from .feedback import AnalysisConfig
8485
from .feedback import compute_drift
8586
from .feedback import compute_question_distribution
@@ -975,14 +976,27 @@ def _evaluate_llm_judge(
975976
then falls back to the Gemini API. Each path evaluates
976977
every criterion in the evaluator and merges the per-session
977978
scores into a single report.
979+
980+
Stamps ``report.details["execution_mode"]`` with one of
981+
``ai_generate``, ``ml_generate_text``, ``api_fallback`` so the
982+
caller (and CI gates) can audit which path actually ran.
983+
When an earlier tier raised before a later tier succeeded,
984+
``report.details["fallback_reason"]`` carries the chained
985+
exception messages in attempt order. (The naming mirrors the
986+
categorical evaluator's ``execution_mode`` value space for
987+
consistency.)
978988
"""
979989
criteria = evaluator._criteria
980990
if not criteria:
981-
return _build_report(
991+
report = _build_report(
982992
evaluator_name=evaluator.name,
983993
dataset=f"{self._table_ref} WHERE {where}",
984994
session_scores=[],
985995
)
996+
report.details["execution_mode"] = "no_op"
997+
return report
998+
999+
fallback_reasons: list[str] = []
9861000

9871001
# Try AI.GENERATE (new path) when endpoint is not a legacy ref
9881002
if not self._is_legacy_model_ref(self.endpoint):
@@ -997,17 +1011,20 @@ def _evaluate_llm_judge(
9971011
params,
9981012
)
9991013
criterion_reports.append((criterion, report))
1000-
return _merge_criterion_reports(
1014+
merged = _merge_criterion_reports(
10011015
evaluator.name,
10021016
f"{self._table_ref} WHERE {where}",
10031017
criteria,
10041018
criterion_reports,
10051019
)
1020+
merged.details["execution_mode"] = "ai_generate"
1021+
return merged
10061022
except Exception as e:
10071023
logger.debug(
10081024
"AI.GENERATE judge failed, trying legacy: %s",
10091025
e,
10101026
)
1027+
fallback_reasons.append(f"ai_generate: {e}")
10111028

10121029
# Try legacy BQML batch evaluation
10131030
text_model = (
@@ -1028,20 +1045,29 @@ def _evaluate_llm_judge(
10281045
text_model,
10291046
)
10301047
criterion_reports.append((criterion, report))
1031-
return _merge_criterion_reports(
1048+
merged = _merge_criterion_reports(
10321049
evaluator.name,
10331050
f"{self._table_ref} WHERE {where}",
10341051
criteria,
10351052
criterion_reports,
10361053
)
1054+
merged.details["execution_mode"] = "ml_generate_text"
1055+
if fallback_reasons:
1056+
merged.details["fallback_reason"] = "; ".join(fallback_reasons)
1057+
return merged
10371058
except Exception as e:
10381059
logger.debug(
10391060
"BQML judge failed, falling back to API: %s",
10401061
e,
10411062
)
1063+
fallback_reasons.append(f"ml_generate_text: {e}")
10421064

10431065
# Fallback: fetch traces using same table/filter, evaluate via API
1044-
return self._api_judge(evaluator, table, where, params)
1066+
api_report = self._api_judge(evaluator, table, where, params)
1067+
api_report.details["execution_mode"] = "api_fallback"
1068+
if fallback_reasons:
1069+
api_report.details["fallback_reason"] = "; ".join(fallback_reasons)
1070+
return api_report
10451071

10461072
def _ai_generate_judge(
10471073
self,
@@ -1054,12 +1080,13 @@ def _ai_generate_judge(
10541080
"""Evaluates using BigQuery AI.GENERATE with typed output."""
10551081
from google.cloud import bigquery as bq
10561082

1083+
prefix, middle, suffix = split_judge_prompt_template(
1084+
criterion.prompt_template
1085+
)
10571086
judge_params = list(params) + [
1058-
bq.ScalarQueryParameter(
1059-
"judge_prompt",
1060-
"STRING",
1061-
criterion.prompt_template.split("{trace_text}")[0],
1062-
),
1087+
bq.ScalarQueryParameter("judge_prompt_prefix", "STRING", prefix),
1088+
bq.ScalarQueryParameter("judge_prompt_middle", "STRING", middle),
1089+
bq.ScalarQueryParameter("judge_prompt_suffix", "STRING", suffix),
10631090
]
10641091

10651092
query = AI_GENERATE_JUDGE_BATCH_QUERY.format(
@@ -1121,12 +1148,13 @@ def _bqml_judge(
11211148
"""Evaluates using BigQuery ML.GENERATE_TEXT."""
11221149
from google.cloud import bigquery as bq
11231150

1151+
prefix, middle, suffix = split_judge_prompt_template(
1152+
criterion.prompt_template
1153+
)
11241154
judge_params = list(params) + [
1125-
bq.ScalarQueryParameter(
1126-
"judge_prompt",
1127-
"STRING",
1128-
criterion.prompt_template.split("{trace_text}")[0],
1129-
),
1155+
bq.ScalarQueryParameter("judge_prompt_prefix", "STRING", prefix),
1156+
bq.ScalarQueryParameter("judge_prompt_middle", "STRING", middle),
1157+
bq.ScalarQueryParameter("judge_prompt_suffix", "STRING", suffix),
11301158
]
11311159

11321160
query = LLM_JUDGE_BATCH_QUERY.format(

src/bigquery_agent_analytics/evaluators.py

Lines changed: 108 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -894,9 +894,15 @@ def sentiment(
894894
result.*
895895
FROM session_traces,
896896
AI.GENERATE(
897+
-- Substitute the full Python prompt_template at SQL time:
898+
-- prefix ++ trace_text ++ middle ++ final_response ++ suffix.
899+
-- Each segment is a separate query parameter so we preserve the
900+
-- exact Python template (including the per-criterion output-format
901+
-- spec) that the API-fallback path uses.
897902
prompt => CONCAT(
898-
@judge_prompt, '\\nTrace:\\n', trace_text,
899-
'\\nResponse:\\n', COALESCE(final_response, 'N/A')
903+
@judge_prompt_prefix, trace_text,
904+
@judge_prompt_middle, COALESCE(final_response, 'N/A'),
905+
@judge_prompt_suffix
900906
),
901907
endpoint => '{endpoint}',
902908
model_params => JSON '{{"temperature": 0.1, "max_output_tokens": 500}}',
@@ -938,9 +944,13 @@ def sentiment(
938944
ML.GENERATE_TEXT(
939945
MODEL `{model}`,
940946
STRUCT(
941-
CONCAT(@judge_prompt, '\\nTrace:\\n', trace_text,
942-
'\\nResponse:\\n', COALESCE(final_response, 'N/A'))
943-
AS prompt
947+
-- Same prefix/middle/suffix substitution as the AI.GENERATE
948+
-- path; preserves the full Python prompt_template.
949+
CONCAT(
950+
@judge_prompt_prefix, trace_text,
951+
@judge_prompt_middle, COALESCE(final_response, 'N/A'),
952+
@judge_prompt_suffix
953+
) AS prompt
944954
),
945955
STRUCT(0.1 AS temperature, 500 AS max_output_tokens)
946956
).ml_generate_text_result AS evaluation
@@ -951,6 +961,99 @@ def sentiment(
951961
LLM_JUDGE_BATCH_QUERY = _LEGACY_LLM_JUDGE_BATCH_QUERY
952962

953963

964+
_TRACE_SENTINEL = "\x00__BQAA_JUDGE_TRACE__\x00"
965+
_RESPONSE_SENTINEL = "\x00__BQAA_JUDGE_RESPONSE__\x00"
966+
967+
968+
def split_judge_prompt_template(prompt_template: str) -> tuple[str, str, str]:
969+
"""Split a Python judge prompt into ``(prefix, middle, suffix)``.
970+
971+
The Python ``LLMAsJudge`` prompt template uses ``{trace_text}`` and
972+
``{final_response}`` placeholders (in that order) to interpolate
973+
per-session inputs. The BigQuery-native ``AI.GENERATE`` and
974+
``ML.GENERATE_TEXT`` paths can't use Python ``str.format`` — they
975+
build the prompt at SQL time. This helper returns the three
976+
literal segments those SQL paths need to ``CONCAT`` together with
977+
the SQL-side ``trace_text`` and ``final_response`` columns,
978+
preserving the exact full template (including the per-criterion
979+
output-format spec that follows the placeholders).
980+
981+
Internally the helper format()s the template once with sentinel
982+
values, so any literal ``{{...}}`` braces in the source template
983+
(e.g. the JSON output spec ``{{"correctness": <score>, ...}}``)
984+
are correctly un-escaped before splitting. The SQL paths see the
985+
same string the API-fallback path's ``str.format(...)`` would
986+
produce.
987+
988+
Args:
989+
prompt_template: The Python prompt template, expected to
990+
contain both ``{trace_text}`` and ``{final_response}``
991+
placeholders in that order.
992+
993+
Returns:
994+
``(prefix, middle, suffix)`` such that
995+
``prefix + trace_text + middle + final_response + suffix``
996+
reproduces ``prompt_template.format(trace_text=..., final_response=...)``
997+
for any inputs. When a placeholder is missing, the helper
998+
synthesizes a labeled section for the missing input and
999+
places the label *immediately before* the injected value
1000+
(label first, then value), so the model reads
1001+
``...Trace:\n<TRACE>\nResponse:\n<RESPONSE>...`` rather than
1002+
the value followed by an orphan label.
1003+
"""
1004+
has_trace = "{trace_text}" in prompt_template
1005+
has_response = "{final_response}" in prompt_template
1006+
1007+
# Reminder for the fallback branches below: the SQL CONCAT runs
1008+
# prefix ++ trace_text ++ middle ++ final_response ++ suffix
1009+
# so any label we synthesize for an absent placeholder must end
1010+
# up *next to* the value it labels (label first, then value),
1011+
# not on the far side of it. Earlier versions appended labels
1012+
# *after* the values, which produced ``<TRACE>\nTrace:\n...``.
1013+
1014+
if not has_trace and not has_response:
1015+
# No placeholders at all. Append a labeled trace + response
1016+
# block after the user's instructions. The labels precede the
1017+
# values so the model reads them in order.
1018+
return (
1019+
prompt_template + "\nTrace:\n",
1020+
"\nResponse:\n",
1021+
"",
1022+
)
1023+
1024+
if not has_trace:
1025+
# final_response placeholder only. Honor the user's structure
1026+
# and inject a labeled trace block right before the response,
1027+
# so the trace label sits next to the trace.
1028+
formatted = prompt_template.format(final_response=_RESPONSE_SENTINEL)
1029+
before_response, _, after_response = formatted.partition(_RESPONSE_SENTINEL)
1030+
return (
1031+
before_response + "\nTrace:\n",
1032+
"\n",
1033+
after_response,
1034+
)
1035+
1036+
if not has_response:
1037+
# trace_text placeholder only. Append a labeled response block
1038+
# after the original template's tail, so the response label
1039+
# sits next to the response value (not after it).
1040+
formatted = prompt_template.format(trace_text=_TRACE_SENTINEL)
1041+
prefix, _, after_trace = formatted.partition(_TRACE_SENTINEL)
1042+
return (
1043+
prefix,
1044+
after_trace + "\nResponse:\n",
1045+
"",
1046+
)
1047+
1048+
formatted = prompt_template.format(
1049+
trace_text=_TRACE_SENTINEL,
1050+
final_response=_RESPONSE_SENTINEL,
1051+
)
1052+
prefix, _, rest = formatted.partition(_TRACE_SENTINEL)
1053+
middle, _, suffix = rest.partition(_RESPONSE_SENTINEL)
1054+
return prefix, middle, suffix
1055+
1056+
9541057
# ------------------------------------------------------------------ #
9551058
# Helpers #
9561059
# ------------------------------------------------------------------ #

0 commit comments

Comments
 (0)