Skip to content

Commit a0b21fb

Browse files
fix(llm-judge): make AI.GENERATE template execute against current BigQuery (#87)
* fix(llm-judge): full prompt parity + execution_mode/fallback_reason (#42) * fix(llm-judge): full prompt parity + execution_mode/fallback_reason Two coupled fixes for LLM-as-Judge: the BQ-native paths now see the full Python prompt template, and the resulting report stamps which path actually ran (and why earlier tiers fell back when applicable). ## Prompt parity (F1) `_ai_generate_judge` and `_bqml_judge` previously sent only `prompt_template.split("{trace_text}")[0]` to BigQuery — i.e., the prefix up to the first placeholder. Everything after `{trace_text}` in the Python template (including the per-criterion JSON output spec the judge model needs to score consistently) was silently dropped on the SQL paths. That made AI.GENERATE / ML.GENERATE_TEXT score against a different prompt than the API-fallback path, which uses the whole template via `str.format(...)`. Fix: - New helper `evaluators.split_judge_prompt_template(template)` that format()s the template with `\x00`-bracketed sentinels for both placeholders, then partitions the result into `(prefix, middle, suffix)`. Sentinels avoid clashing with literal template content; running the format pass ensures `{{...}}` escapes are correctly un-escaped before partitioning, so the SQL CONCAT sees the same string the API path produces. - `AI_GENERATE_JUDGE_BATCH_QUERY` and `_LEGACY_LLM_JUDGE_BATCH_QUERY` now `CONCAT(@judge_prompt_prefix, trace_text, @judge_prompt_middle, COALESCE(final_response, 'N/A'), @judge_prompt_suffix)` — three parameters instead of one. - Both judge methods in `client.py` swap `judge_prompt` for the three new parameters. ## execution_mode + fallback_reason (F2/F3) `_evaluate_llm_judge` now stamps `report.details["execution_mode"]` with one of `ai_generate`, `ml_generate_text`, `api_fallback`, or `no_op` — matching the value-space the categorical evaluator already uses. When an earlier tier raises before a later tier succeeds, `report.details["fallback_reason"]` carries the chained exception messages in attempt order so CI gates and dashboards can audit which path actually ran. Categorical-style underscore naming is intentional — readers reading both LLM-judge and categorical reports see the same vocabulary. ## Tests - Prompt parity: assert AI.GENERATE and ML.GENERATE_TEXT receive three `judge_prompt_{prefix,middle,suffix}` params instead of the single `judge_prompt`, and that concatenation reproduces the full Python template (including the JSON output spec). - Execution mode: assert each of ai_generate / ml_generate_text / api_fallback fires under the right cascade conditions, and that `fallback_reason` names the prior tiers in attempt order. - `split_judge_prompt_template`: round-trip, missing-placeholder fallback paths, full-template-as-prefix when neither placeholder is present. CHANGELOG entry added under `[Unreleased]`. Required publish blocker for blog post #3 (#82). PR #2 in this series will tighten the LLM-judge `evaluate --exit-code` FAIL output to surface criterion + threshold + bounded `llm_feedback` snippet. Ref: #82, #51. * fix(llm-judge): keep synthesized labels next to their values Reviewer flagged that ``split_judge_prompt_template`` mishandles custom templates with one placeholder. The SQL CONCAT runs prefix ++ trace_text ++ middle ++ final_response ++ suffix so a synthesized label for an absent placeholder must end up *immediately before* the value it labels. Earlier fallback branches placed the labels on the wrong side: - ``{trace_text}`` only — ``Response:`` label landed AFTER the injected response value. - ``{final_response}`` only — ``Trace:`` label landed AFTER the injected trace value, and the user's prompt prose ended up AFTER the trace instead of before it. - No placeholders — labels appended after their values for both trace and response. Built-in correctness/hallucination/sentiment templates were unaffected because they declare both placeholders explicitly, so the dual-placeholder branch (which was correct) handled them. Fix: rewrite the three fallback branches so the SQL CONCAT yields ``...<user prompt>...\nTrace:\n<TRACE>\nResponse:\n<RESPONSE>...`` in every case. Updated the docstring to document the rebuild contract precisely. Tests: replaced the two-segment "label is somewhere in suffix" assertions with full-rebuild assertions that verify ordering of user prose, ``Trace:`` label, trace value, ``Response:`` label, and response value. Three new regression tests cover all three fallback branches. Ref: PR #42 review feedback. * feat(evaluate): LLM-judge FAIL output with bounded feedback + --strict docs (#44) * feat(evaluate): LLM-judge FAIL output with bounded feedback + --strict docs Builds on the runtime-behavior fixes from #42 (now merged on main): the report shape this PR depends on — ``SessionScore.llm_feedback`` populated by both BQ-native and API-fallback judge paths, and ``EvaluationReport.details["execution_mode"]`` distinguishing the three tiers — is stable as of 0.2.2 + #42. ## F5 — feedback snippet on LLM-judge FAIL lines ``_emit_evaluate_failures`` now appends a bounded ``feedback="…"`` field after ``score`` / ``threshold`` whenever a failing session's ``SessionScore.llm_feedback`` is non-empty. The snippet: - Collapses internal whitespace runs (newlines, tabs, doubled spaces) to a single space, so multi-line judge justifications stay on one CI log line. - Truncates at 120 characters with a U+2026 ellipsis when the collapsed string is longer. - Is omitted entirely for code-based metrics (their ``llm_feedback`` is ``None``), so post #2's deterministic FAIL output stays visually identical. Pulled the formatter into a small public-internal helper (``_format_feedback_snippet``) so the CLI tests can exercise it directly without round-tripping through Click. The safety-net "no per-metric detail available" branch also carries the snippet now — failing sessions with no metric details still surface the judge's reason rather than a generic placeholder. ## F4 — ``--strict`` help/docs match shipped behavior The previous ``--strict`` help text ("Fail sessions with unparseable judge output") promised a broader effect than what ships. In reality: - API-fallback parse errors already coerce to ``score=0.0`` and fail any non-zero threshold without ``--strict``. - ``--strict`` only changes AI.GENERATE rows whose typed output is empty/NULL: without it they're silently skipped (the default ``passed=True`` for empty-scores SessionScores); with it they're explicitly failed and counted under ``report.details``. Updated the CLI help string to name AI.GENERATE specifically and to call out that API-fallback parse errors don't need the flag. Rewrote ``SDK.md §4 Strict Mode`` to lead with the AI.GENERATE path scope before getting into operational counters. ## Tests (10 new) - LLM-judge FAIL line carries the feedback snippet, with the judge's actual prose visible in CI output. - Long feedback truncates at the 120-char cap with U+2026. - Multi-line feedback collapses to a single CI log line. - Code-based metric failures emit no ``feedback`` field (regression guard against bleed from the LLM path). - Direct unit tests on ``_format_feedback_snippet``: ``None`` / empty / whitespace-only inputs return ``None``; short inputs pass through; whitespace runs collapse; default and custom ``max_chars`` truncation respected. CHANGELOG entry added under ``[Unreleased]`` (Added + Changed sections). Required publish blocker for blog post #3 (#82). Pairs with #42 as PR 2 of 2 in that series. Ref: #82, #51. * docs(strict): describe parse-error visibility, not pass/fail flipping Reviewer flagged that the prior --strict rewrite still mischaracterized the AI.GENERATE behavior: it claimed strict=False silently skips empty/NULL typed-output rows and leaves them passing, then says strict=True flips them to failing. That's wrong on both ends. Actual code path: - ``_ai_generate_judge`` (and ``_bqml_judge``) compute ``passed = bool(scores) and all(score >= threshold ...)``. Empty-scores rows already have passed=False — locked in by TestFalsePassFix.test_empty_score_fails on the runtime side. - ``_apply_strict_mode`` walks the report and adds ``details['parse_error']=True`` to every empty-scores session plus a report-level ``parse_errors`` / ``parse_error_rate`` counter. It does not change any session's pass/fail status. - The API-fallback path coerces malformed output to ``score=0.0``, so its parse failures fail as low-score failures and don't surface through ``--strict``. Net: ``--strict`` is a visibility knob, not a gate-affecting flag. For pass/fail-only CI consumers it's a no-op; for dashboards or investigations it lets you tell ``low score`` failures apart from ``no parseable score`` failures. Updated: - ``cli.py:strict`` help text — replaces the misleading "fail/silently-skipped" framing with the parse-error metadata framing the code actually implements. - ``SDK.md §4 Strict Mode`` — leads with "adds parse-error visibility, does not flip pass/fail," then enumerates exactly what ``_apply_strict_mode`` does. - ``CHANGELOG.md`` ``[Unreleased]`` entry — same correction. No code or test changes — only docs/help. Existing pytest run still 2080 passed, 4 skipped. Ref: PR #44 review feedback. * fix(llm-judge): make AI.GENERATE template execute against current BigQuery (#45) * fix(llm-judge): make AI.GENERATE template execute against current BigQuery Earlier SDK versions emitted FROM session_traces, AI.GENERATE( prompt => ..., endpoint => ..., model_params => JSON '{"temperature": 0.1, "max_output_tokens": 500}', output_schema => 'score INT64, justification STRING' ) AS result That shape no longer parses against current BigQuery: AI.GENERATE is a scalar function (returns ``STRUCT<result, full_response, status, ...>`` shaped by ``output_schema``), not a table-valued function. The flat ``model_params`` dict also rejects with ``does not conform to the GenerateContent request body`` — the new shape is ``{"generationConfig": {"temperature": ..., "maxOutputTokens": ...}}``. Mocked unit tests bypassed real query execution and never noticed. ## What this PR does 1. Replaces the table-valued ``FROM ..., AI.GENERATE(...)`` with a scalar ``SELECT AI.GENERATE(...).score, ....`` form, wrapped in an outer SELECT that flattens the struct to columns the existing ``_ai_generate_judge`` row-iteration code already reads (``score``, ``justification``, ``gen_status``). 2. Switches ``model_params`` to the ``{"generationConfig": {...}}`` shape required by current ``AI.GENERATE``. 3. Bumps ``maxOutputTokens`` from 500 to 1024 so the judge has enough room to return both a score and a justification on verbose criteria; the previous 500-token cap clipped some responses to MAX_TOKENS. 4. Adds ``evaluators.render_ai_generate_judge_query(...)`` so the optional ``connection_id`` argument can be inlined into the SQL only when supplied. ``connection_id`` is now an *optional escape hatch*: omit it and AI.GENERATE runs on the submitter's end-user credentials; supply ``Client(..., connection_id="us.foo")`` and the call routes through that connection's service account. 5. Plumbs ``self.connection_id`` from ``Client`` through to ``_ai_generate_judge`` so a connection set at client construction propagates to the judge SQL automatically. ## Live verification Adds ``tests/test_ai_generate_judge_live.py`` with three gated tests that submit the exact rendered SQL to a real BigQuery project. Skipped by default; set ``BQAA_RUN_LIVE_TESTS=1`` plus ``PROJECT_ID`` / ``DATASET_ID`` to opt in. Optionally set ``BQAA_AI_GENERATE_CONNECTION_ID`` to also exercise the connection-supplied form. Smoke run against ``test-project-0728-467323.agent_analytics_demo``: all 3 pass. End-to-end ``client.evaluate(evaluator=LLMAsJudge.correctness())`` through the public API now returns ``execution_mode='ai_generate'``, real correctness scores, and real LLM justifications against live BigQuery; verified against three sessions in the sandbox project. ## Test plan - [x] ``pytest tests/`` (mocked path) -> 2080 passed, 7 skipped (was 4 before; the 3 new live tests skip without the env gate). - [x] ``BQAA_RUN_LIVE_TESTS=1 ... pytest tests/test_ai_generate_judge_live.py`` -> 3 passed against live BQ. - [x] End-to-end ``Client.evaluate`` smoke against live BQ with ``LLMAsJudge.correctness()``: real scores + justifications returned, ``execution_mode == 'ai_generate'``. ## Why this matters This is a quiet publish blocker for blog post #3 (#82). The post's central claim — "AI.GENERATE keeps evaluation in BigQuery" — was false against the released SDK because the SDK couldn't actually execute its own AI.GENERATE template. The mock-only test coverage let the bug ship in 0.2.x. The new gated live test catches this class of mock-divergence going forward. Ref: #82, #51. * docs(changelog): merge duplicate Unreleased Added sections Reviewer flagged that PR #45's CHANGELOG edits left the [Unreleased] section with two ### Added blocks and the prior "AI.GENERATE / ML.GENERATE_TEXT prompt template" bug fix stranded between them under the wrong heading. Restructured to one ### Fixed (both LLM-judge bug fixes), one ### Added (new entry points + live tests + execution_mode + split helper + feedback snippet), and one ### Changed (the strict-mode docs) — Keep-a-Changelog conformant. No content changes — just consolidating duplicate headers and moving the orphan bullet under the right section. Ref: PR #45 review feedback. --------- Co-authored-by: Haiyuan Cao <haiyuan@google.com>
1 parent 5db449c commit a0b21fb

4 files changed

Lines changed: 352 additions & 37 deletions

File tree

CHANGELOG.md

Lines changed: 75 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -9,27 +9,84 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
99

1010
### Fixed
1111

12-
- **LLM-as-Judge AI.GENERATE / ML.GENERATE_TEXT now uses the full Python
13-
prompt template.** Previously both BQ-native paths sent only
14-
``prompt_template.split('{trace_text}')[0]`` to BigQuery, silently
15-
dropping every instruction that followed the placeholders — including
16-
the per-criterion output-format spec the judge model needs to score
17-
consistently with the API-fallback path. The two BQ paths and the
18-
Python API path now produce comparable scores against the same prompt.
12+
- **LLM-as-Judge AI.GENERATE path now executes against current
13+
BigQuery.** Earlier versions emitted a table-valued
14+
``FROM session_traces, AI.GENERATE(...) AS result`` shape with
15+
``output_schema`` and a flat ``model_params`` dict. Current
16+
``AI.GENERATE`` is a scalar function that returns a STRUCT;
17+
the table-valued form raises ``Table-valued function not found``
18+
and the flat ``model_params`` raises ``does not conform to the
19+
GenerateContent request body``. Mocked unit tests passed because
20+
they bypassed real query execution. The SDK now renders a
21+
``SELECT AI.GENERATE(...).score, ...`` query with a
22+
``generationConfig``-wrapped ``model_params`` and ``output_schema``
23+
on the scalar form, runs against live BigQuery, and unwraps the
24+
returned struct's ``score`` / ``justification`` / ``status``
25+
fields.
26+
- **LLM-as-Judge AI.GENERATE / ML.GENERATE_TEXT now uses the full
27+
Python prompt template.** Previously both BQ-native paths sent
28+
only ``prompt_template.split('{trace_text}')[0]`` to BigQuery,
29+
silently dropping every instruction that followed the
30+
placeholders — including the per-criterion output-format spec
31+
the judge model needs to score consistently with the
32+
API-fallback path. The two BQ paths and the Python API path now
33+
produce comparable scores against the same prompt.
1934

2035
### Added
2136

22-
- ``EvaluationReport.details["execution_mode"]`` is now populated for
23-
LLM-as-Judge runs with one of ``ai_generate``, ``ml_generate_text``,
24-
``api_fallback``, or ``no_op`` — matching the value space the
25-
categorical evaluator already exposes. When an earlier tier raised
26-
before a later tier succeeded, ``details["fallback_reason"]`` carries
27-
the chained exception messages in attempt order, so CI and dashboards
28-
can audit which path actually ran.
29-
- ``evaluators.split_judge_prompt_template(prompt_template)`` is the
30-
helper the SQL paths use to safely substitute the template into
31-
``CONCAT()``; exposed publicly for downstream code that needs the
32-
same shape.
37+
- ``evaluators.render_ai_generate_judge_query(...)`` is the new
38+
entry point that builds the AI.GENERATE batch SQL.
39+
``connection_id`` is optional — when omitted the call uses
40+
end-user credentials; when supplied it inlines the
41+
``connection_id =>`` argument so callers can route through a
42+
service-account-owned connection when their environment
43+
requires it.
44+
- ``Client.connection_id`` already existed; it is now plumbed
45+
through to ``_ai_generate_judge`` so a connection set at client
46+
construction propagates to the judge SQL automatically.
47+
- Live BigQuery integration tests for the LLM-judge AI.GENERATE
48+
path (``tests/test_ai_generate_judge_live.py``). Skipped by
49+
default; opt in with ``BQAA_RUN_LIVE_TESTS=1`` plus
50+
``PROJECT_ID`` / ``DATASET_ID``. Three tests cover SQL parse
51+
acceptance, expected result-schema column names, and the
52+
``connection_id`` escape hatch when
53+
``BQAA_AI_GENERATE_CONNECTION_ID`` is set. Catches the class of
54+
mock-divergence bug that let the prior broken template ship.
55+
- ``EvaluationReport.details["execution_mode"]`` is now populated
56+
for LLM-as-Judge runs with one of ``ai_generate``,
57+
``ml_generate_text``, ``api_fallback``, or ``no_op`` — matching
58+
the value space the categorical evaluator already exposes. When
59+
an earlier tier raised before a later tier succeeded,
60+
``details["fallback_reason"]`` carries the chained exception
61+
messages in attempt order, so CI and dashboards can audit which
62+
path actually ran.
63+
- ``evaluators.split_judge_prompt_template(prompt_template)`` is
64+
the helper the SQL paths use to safely substitute the template
65+
into ``CONCAT()``; exposed publicly for downstream code that
66+
needs the same shape.
67+
- ``bq-agent-sdk evaluate --exit-code`` FAIL lines now carry a
68+
bounded ``feedback="…"`` snippet drawn from
69+
``SessionScore.llm_feedback`` for LLM-judge failures. The
70+
snippet collapses internal whitespace to a single space,
71+
truncates to 120 characters with an ellipsis, and is omitted
72+
entirely for code-based metrics (which leave ``llm_feedback``
73+
empty). CI logs now explain *why* the judge said the session
74+
failed without forcing the reader to chase the JSON output.
75+
76+
### Changed
77+
78+
- ``--strict`` help text and ``SDK.md §4`` clarified to match shipped
79+
behavior. ``--strict`` is a *visibility* knob — it stamps
80+
``details['parse_error']=True`` on AI.GENERATE/ML.GENERATE_TEXT
81+
judge rows whose ``scores`` dict is empty, and adds a report-level
82+
``parse_errors`` counter. It does **not** flip any session's
83+
pass/fail outcome: both BQ-native judge methods compute ``passed``
84+
as ``bool(scores) and all(...)``, so empty-scores rows already
85+
fail without the flag. API-fallback parse errors coerce to
86+
``score=0.0``, so they fail as low-score failures rather than
87+
parse errors. For pass/fail-only CI consumers ``--strict`` is a
88+
no-op; reach for it when a dashboard needs to tell "no parseable
89+
score" apart from "low score."
3390

3491
## [0.2.2] - 2026-04-24
3592

src/bigquery_agent_analytics/client.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -78,6 +78,7 @@
7878
from .evaluators import EvaluationReport
7979
from .evaluators import LLM_JUDGE_BATCH_QUERY
8080
from .evaluators import LLMAsJudge
81+
from .evaluators import render_ai_generate_judge_query
8182
from .evaluators import SESSION_SUMMARY_QUERY
8283
from .evaluators import SessionScore
8384
from .evaluators import split_judge_prompt_template
@@ -1089,12 +1090,13 @@ def _ai_generate_judge(
10891090
bq.ScalarQueryParameter("judge_prompt_suffix", "STRING", suffix),
10901091
]
10911092

1092-
query = AI_GENERATE_JUDGE_BATCH_QUERY.format(
1093+
query = render_ai_generate_judge_query(
10931094
project=self.project_id,
10941095
dataset=self.dataset_id,
10951096
table=table,
10961097
where=where,
10971098
endpoint=self.endpoint,
1099+
connection_id=self.connection_id,
10981100
)
10991101
job_config = bq.QueryJobConfig(
11001102
query_parameters=judge_params,

src/bigquery_agent_analytics/evaluators.py

Lines changed: 71 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -862,7 +862,7 @@ def sentiment(
862862
LIMIT @trace_limit
863863
"""
864864

865-
AI_GENERATE_JUDGE_BATCH_QUERY = """\
865+
_AI_GENERATE_JUDGE_BATCH_QUERY_TEMPLATE = """\
866866
WITH session_traces AS (
867867
SELECT
868868
session_id,
@@ -891,25 +891,78 @@ def sentiment(
891891
session_id,
892892
trace_text,
893893
final_response,
894-
result.*
895-
FROM session_traces,
896-
AI.GENERATE(
897-
-- Substitute the full Python prompt_template at SQL time:
898-
-- prefix ++ trace_text ++ middle ++ final_response ++ suffix.
899-
-- Each segment is a separate query parameter so we preserve the
900-
-- exact Python template (including the per-criterion output-format
901-
-- spec) that the API-fallback path uses.
902-
prompt => CONCAT(
903-
@judge_prompt_prefix, trace_text,
904-
@judge_prompt_middle, COALESCE(final_response, 'N/A'),
905-
@judge_prompt_suffix
906-
),
907-
endpoint => '{endpoint}',
908-
model_params => JSON '{{"temperature": 0.1, "max_output_tokens": 500}}',
909-
output_schema => 'score INT64, justification STRING'
910-
) AS result
894+
gen.score AS score,
895+
gen.justification AS justification,
896+
gen.status AS gen_status
897+
FROM (
898+
SELECT
899+
session_id,
900+
trace_text,
901+
final_response,
902+
AI.GENERATE(
903+
-- The Python prompt template is rebuilt at SQL time:
904+
-- prefix ++ trace_text ++ middle ++ final_response ++ suffix
905+
-- Each segment is a separate query parameter so AI.GENERATE
906+
-- sees the exact full Python template (including the
907+
-- per-criterion output-format spec) the API-fallback path uses.
908+
prompt => CONCAT(
909+
@judge_prompt_prefix, trace_text,
910+
@judge_prompt_middle, COALESCE(final_response, 'N/A'),
911+
@judge_prompt_suffix
912+
),
913+
endpoint => '{endpoint}',{connection_arg}
914+
model_params => JSON '{{"generationConfig": {{"temperature": 0.1, "maxOutputTokens": 1024}}}}',
915+
output_schema => 'score INT64, justification STRING'
916+
) AS gen
917+
FROM session_traces
918+
)
911919
"""
912920

921+
922+
def render_ai_generate_judge_query(
923+
*,
924+
project: str,
925+
dataset: str,
926+
table: str,
927+
where: str,
928+
endpoint: str,
929+
connection_id: Optional[str] = None,
930+
) -> str:
931+
"""Render the AI.GENERATE judge batch query for a given config.
932+
933+
``AI.GENERATE`` is BigQuery's scalar generative function (it returns a
934+
``STRUCT<score, justification, full_response, status, ...>`` shaped
935+
by ``output_schema``). The function call lives inside a regular
936+
``SELECT`` — it is *not* a table-valued function, so the surrounding
937+
``FROM session_traces, AI.GENERATE(...)`` lateral-join syntax used
938+
by older SDK versions does not parse against current BigQuery.
939+
940+
``connection_id`` is optional. When supplied (e.g.
941+
``"us.bqaa_ai_generate"``) the call uses that connection's service
942+
account; when omitted, AI.GENERATE runs against the end-user
943+
credentials of whichever account submits the job. Both shapes are
944+
documented forms of the same function.
945+
"""
946+
if connection_id:
947+
connection_arg = f"\n connection_id => '{connection_id}',"
948+
else:
949+
connection_arg = ""
950+
return _AI_GENERATE_JUDGE_BATCH_QUERY_TEMPLATE.format(
951+
project=project,
952+
dataset=dataset,
953+
table=table,
954+
where=where,
955+
endpoint=endpoint,
956+
connection_arg=connection_arg,
957+
)
958+
959+
960+
# Public alias kept for downstream code that imports the raw template
961+
# string (e.g. for inspection / docs). Callers building queries should
962+
# use ``render_ai_generate_judge_query`` instead so the optional
963+
# ``connection_id`` arg is wired correctly.
964+
AI_GENERATE_JUDGE_BATCH_QUERY = _AI_GENERATE_JUDGE_BATCH_QUERY_TEMPLATE
965+
913966
# Legacy template kept for backward compatibility with pre-created
914967
# BQ ML models.
915968
_LEGACY_LLM_JUDGE_BATCH_QUERY = """\

0 commit comments

Comments
 (0)