fix(llm-judge): make AI.GENERATE template execute against current BigQuery (#87)

caohy1988 · haiyuan-eng-google · web-flow · commit a0b21fbf6ba1 · 2026-04-27T12:07:55.000-07:00
* fix(llm-judge): full prompt parity + execution_mode/fallback_reason (#42) * fix(llm-judge): full prompt parity + execution_mode/fallback_reason Two coupled fixes for LLM-as-Judge: the BQ-native paths now see the full Python prompt template, and the resulting report stamps which path actually ran (and why earlier tiers fell back when applicable). ## Prompt parity (F1) `_ai_generate_judge` and `_bqml_judge` previously sent only `prompt_template.split("{trace_text}")[0]` to BigQuery — i.e., the prefix up to the first placeholder. Everything after `{trace_text}` in the Python template (including the per-criterion JSON output spec the judge model needs to score consistently) was silently dropped on the SQL paths. That made AI.GENERATE / ML.GENERATE_TEXT score against a different prompt than the API-fallback path, which uses the whole template via `str.format(...)`. Fix: - New helper `evaluators.split_judge_prompt_template(template)` that format()s the template with `\x00`-bracketed sentinels for both placeholders, then partitions the result into `(prefix, middle, suffix)`. Sentinels avoid clashing with literal template content; running the format pass ensures `{{...}}` escapes are correctly un-escaped before partitioning, so the SQL CONCAT sees the same string the API path produces. - `AI_GENERATE_JUDGE_BATCH_QUERY` and `_LEGACY_LLM_JUDGE_BATCH_QUERY` now `CONCAT(@judge_prompt_prefix, trace_text, @judge_prompt_middle, COALESCE(final_response, 'N/A'), @judge_prompt_suffix)` — three parameters instead of one. - Both judge methods in `client.py` swap `judge_prompt` for the three new parameters. ## execution_mode + fallback_reason (F2/F3) `_evaluate_llm_judge` now stamps `report.details["execution_mode"]` with one of `ai_generate`, `ml_generate_text`, `api_fallback`, or `no_op` — matching the value-space the categorical evaluator already uses. When an earlier tier raises before a later tier succeeds, `report.details["fallback_reason"]` carries the chained exception messages in attempt order so CI gates and dashboards can audit which path actually ran. Categorical-style underscore naming is intentional — readers reading both LLM-judge and categorical reports see the same vocabulary. ## Tests - Prompt parity: assert AI.GENERATE and ML.GENERATE_TEXT receive three `judge_prompt_{prefix,middle,suffix}` params instead of the single `judge_prompt`, and that concatenation reproduces the full Python template (including the JSON output spec). - Execution mode: assert each of ai_generate / ml_generate_text / api_fallback fires under the right cascade conditions, and that `fallback_reason` names the prior tiers in attempt order. - `split_judge_prompt_template`: round-trip, missing-placeholder fallback paths, full-template-as-prefix when neither placeholder is present. CHANGELOG entry added under `[Unreleased]`. Required publish blocker for blog post #3 (#82). PR #2 in this series will tighten the LLM-judge `evaluate --exit-code` FAIL output to surface criterion + threshold + bounded `llm_feedback` snippet. Ref: #82, #51. * fix(llm-judge): keep synthesized labels next to their values Reviewer flagged that ``split_judge_prompt_template`` mishandles custom templates with one placeholder. The SQL CONCAT runs prefix ++ trace_text ++ middle ++ final_response ++ suffix so a synthesized label for an absent placeholder must end up *immediately before* the value it labels. Earlier fallback branches placed the labels on the wrong side: - ``{trace_text}`` only — ``Response:`` label landed AFTER the injected response value. - ``{final_response}`` only — ``Trace:`` label landed AFTER the injected trace value, and the user's prompt prose ended up AFTER the trace instead of before it. - No placeholders — labels appended after their values for both trace and response. Built-in correctness/hallucination/sentiment templates were unaffected because they declare both placeholders explicitly, so the dual-placeholder branch (which was correct) handled them. Fix: rewrite the three fallback branches so the SQL CONCAT yields ``...<user prompt>...\nTrace:\n<TRACE>\nResponse:\n<RESPONSE>...`` in every case. Updated the docstring to document the rebuild contract precisely. Tests: replaced the two-segment "label is somewhere in suffix" assertions with full-rebuild assertions that verify ordering of user prose, ``Trace:`` label, trace value, ``Response:`` label, and response value. Three new regression tests cover all three fallback branches. Ref: PR #42 review feedback. * feat(evaluate): LLM-judge FAIL output with bounded feedback + --strict docs (#44) * feat(evaluate): LLM-judge FAIL output with bounded feedback + --strict docs Builds on the runtime-behavior fixes from #42 (now merged on main): the report shape this PR depends on — ``SessionScore.llm_feedback`` populated by both BQ-native and API-fallback judge paths, and ``EvaluationReport.details["execution_mode"]`` distinguishing the three tiers — is stable as of 0.2.2 + #42. ## F5 — feedback snippet on LLM-judge FAIL lines ``_emit_evaluate_failures`` now appends a bounded ``feedback="…"`` field after ``score`` / ``threshold`` whenever a failing session's ``SessionScore.llm_feedback`` is non-empty. The snippet: - Collapses internal whitespace runs (newlines, tabs, doubled spaces) to a single space, so multi-line judge justifications stay on one CI log line. - Truncates at 120 characters with a U+2026 ellipsis when the collapsed string is longer. - Is omitted entirely for code-based metrics (their ``llm_feedback`` is ``None``), so post #2's deterministic FAIL output stays visually identical. Pulled the formatter into a small public-internal helper (``_format_feedback_snippet``) so the CLI tests can exercise it directly without round-tripping through Click. The safety-net "no per-metric detail available" branch also carries the snippet now — failing sessions with no metric details still surface the judge's reason rather than a generic placeholder. ## F4 — ``--strict`` help/docs match shipped behavior The previous ``--strict`` help text ("Fail sessions with unparseable judge output") promised a broader effect than what ships. In reality: - API-fallback parse errors already coerce to ``score=0.0`` and fail any non-zero threshold without ``--strict``. - ``--strict`` only changes AI.GENERATE rows whose typed output is empty/NULL: without it they're silently skipped (the default ``passed=True`` for empty-scores SessionScores); with it they're explicitly failed and counted under ``report.details``. Updated the CLI help string to name AI.GENERATE specifically and to call out that API-fallback parse errors don't need the flag. Rewrote ``SDK.md §4 Strict Mode`` to lead with the AI.GENERATE path scope before getting into operational counters. ## Tests (10 new) - LLM-judge FAIL line carries the feedback snippet, with the judge's actual prose visible in CI output. - Long feedback truncates at the 120-char cap with U+2026. - Multi-line feedback collapses to a single CI log line. - Code-based metric failures emit no ``feedback`` field (regression guard against bleed from the LLM path). - Direct unit tests on ``_format_feedback_snippet``: ``None`` / empty / whitespace-only inputs return ``None``; short inputs pass through; whitespace runs collapse; default and custom ``max_chars`` truncation respected. CHANGELOG entry added under ``[Unreleased]`` (Added + Changed sections). Required publish blocker for blog post #3 (#82). Pairs with #42 as PR 2 of 2 in that series. Ref: #82, #51. * docs(strict): describe parse-error visibility, not pass/fail flipping Reviewer flagged that the prior --strict rewrite still mischaracterized the AI.GENERATE behavior: it claimed strict=False silently skips empty/NULL typed-output rows and leaves them passing, then says strict=True flips them to failing. That's wrong on both ends. Actual code path: - ``_ai_generate_judge`` (and ``_bqml_judge``) compute ``passed = bool(scores) and all(score >= threshold ...)``. Empty-scores rows already have passed=False — locked in by TestFalsePassFix.test_empty_score_fails on the runtime side. - ``_apply_strict_mode`` walks the report and adds ``details['parse_error']=True`` to every empty-scores session plus a report-level ``parse_errors`` / ``parse_error_rate`` counter. It does not change any session's pass/fail status. - The API-fallback path coerces malformed output to ``score=0.0``, so its parse failures fail as low-score failures and don't surface through ``--strict``. Net: ``--strict`` is a visibility knob, not a gate-affecting flag. For pass/fail-only CI consumers it's a no-op; for dashboards or investigations it lets you tell ``low score`` failures apart from ``no parseable score`` failures. Updated: - ``cli.py:strict`` help text — replaces the misleading "fail/silently-skipped" framing with the parse-error metadata framing the code actually implements. - ``SDK.md §4 Strict Mode`` — leads with "adds parse-error visibility, does not flip pass/fail," then enumerates exactly what ``_apply_strict_mode`` does. - ``CHANGELOG.md`` ``[Unreleased]`` entry — same correction. No code or test changes — only docs/help. Existing pytest run still 2080 passed, 4 skipped. Ref: PR #44 review feedback. * fix(llm-judge): make AI.GENERATE template execute against current BigQuery (#45) * fix(llm-judge): make AI.GENERATE template execute against current BigQuery Earlier SDK versions emitted FROM session_traces, AI.GENERATE( prompt => ..., endpoint => ..., model_params => JSON '{"temperature": 0.1, "max_output_tokens": 500}', output_schema => 'score INT64, justification STRING' ) AS result That shape no longer parses against current BigQuery: AI.GENERATE is a scalar function (returns ``STRUCT<result, full_response, status, ...>`` shaped by ``output_schema``), not a table-valued function. The flat ``model_params`` dict also rejects with ``does not conform to the GenerateContent request body`` — the new shape is ``{"generationConfig": {"temperature": ..., "maxOutputTokens": ...}}``. Mocked unit tests bypassed real query execution and never noticed. ## What this PR does 1. Replaces the table-valued ``FROM ..., AI.GENERATE(...)`` with a scalar ``SELECT AI.GENERATE(...).score, ....`` form, wrapped in an outer SELECT that flattens the struct to columns the existing ``_ai_generate_judge`` row-iteration code already reads (``score``, ``justification``, ``gen_status``). 2. Switches ``model_params`` to the ``{"generationConfig": {...}}`` shape required by current ``AI.GENERATE``. 3. Bumps ``maxOutputTokens`` from 500 to 1024 so the judge has enough room to return both a score and a justification on verbose criteria; the previous 500-token cap clipped some responses to MAX_TOKENS. 4. Adds ``evaluators.render_ai_generate_judge_query(...)`` so the optional ``connection_id`` argument can be inlined into the SQL only when supplied. ``connection_id`` is now an *optional escape hatch*: omit it and AI.GENERATE runs on the submitter's end-user credentials; supply ``Client(..., connection_id="us.foo")`` and the call routes through that connection's service account. 5. Plumbs ``self.connection_id`` from ``Client`` through to ``_ai_generate_judge`` so a connection set at client construction propagates to the judge SQL automatically. ## Live verification Adds ``tests/test_ai_generate_judge_live.py`` with three gated tests that submit the exact rendered SQL to a real BigQuery project. Skipped by default; set ``BQAA_RUN_LIVE_TESTS=1`` plus ``PROJECT_ID`` / ``DATASET_ID`` to opt in. Optionally set ``BQAA_AI_GENERATE_CONNECTION_ID`` to also exercise the connection-supplied form. Smoke run against ``test-project-0728-467323.agent_analytics_demo``: all 3 pass. End-to-end ``client.evaluate(evaluator=LLMAsJudge.correctness())`` through the public API now returns ``execution_mode='ai_generate'``, real correctness scores, and real LLM justifications against live BigQuery; verified against three sessions in the sandbox project. ## Test plan - [x] ``pytest tests/`` (mocked path) -> 2080 passed, 7 skipped (was 4 before; the 3 new live tests skip without the env gate). - [x] ``BQAA_RUN_LIVE_TESTS=1 ... pytest tests/test_ai_generate_judge_live.py`` -> 3 passed against live BQ. - [x] End-to-end ``Client.evaluate`` smoke against live BQ with ``LLMAsJudge.correctness()``: real scores + justifications returned, ``execution_mode == 'ai_generate'``. ## Why this matters This is a quiet publish blocker for blog post #3 (#82). The post's central claim — "AI.GENERATE keeps evaluation in BigQuery" — was false against the released SDK because the SDK couldn't actually execute its own AI.GENERATE template. The mock-only test coverage let the bug ship in 0.2.x. The new gated live test catches this class of mock-divergence going forward. Ref: #82, #51. * docs(changelog): merge duplicate Unreleased Added sections Reviewer flagged that PR #45's CHANGELOG edits left the [Unreleased] section with two ### Added blocks and the prior "AI.GENERATE / ML.GENERATE_TEXT prompt template" bug fix stranded between them under the wrong heading. Restructured to one ### Fixed (both LLM-judge bug fixes), one ### Added (new entry points + live tests + execution_mode + split helper + feedback snippet), and one ### Changed (the strict-mode docs) — Keep-a-Changelog conformant. No content changes — just consolidating duplicate headers and moving the orphan bullet under the right section. Ref: PR #45 review feedback. --------- Co-authored-by: Haiyuan Cao <haiyuan@google.com>
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -9,27 +9,84 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ### Fixed
 
-- **LLM-as-Judge AI.GENERATE / ML.GENERATE_TEXT now uses the full Python
-  prompt template.** Previously both BQ-native paths sent only
-  ``prompt_template.split('{trace_text}')[0]`` to BigQuery, silently
-  dropping every instruction that followed the placeholders — including
-  the per-criterion output-format spec the judge model needs to score
-  consistently with the API-fallback path. The two BQ paths and the
-  Python API path now produce comparable scores against the same prompt.
+- **LLM-as-Judge AI.GENERATE path now executes against current
+  BigQuery.** Earlier versions emitted a table-valued
+  ``FROM session_traces, AI.GENERATE(...) AS result`` shape with
+  ``output_schema`` and a flat ``model_params`` dict. Current
+  ``AI.GENERATE`` is a scalar function that returns a STRUCT;
+  the table-valued form raises ``Table-valued function not found``
+  and the flat ``model_params`` raises ``does not conform to the
+  GenerateContent request body``. Mocked unit tests passed because
+  they bypassed real query execution. The SDK now renders a
+  ``SELECT AI.GENERATE(...).score, ...`` query with a
+  ``generationConfig``-wrapped ``model_params`` and ``output_schema``
+  on the scalar form, runs against live BigQuery, and unwraps the
+  returned struct's ``score`` / ``justification`` / ``status``
+  fields.
+- **LLM-as-Judge AI.GENERATE / ML.GENERATE_TEXT now uses the full
+  Python prompt template.** Previously both BQ-native paths sent
+  only ``prompt_template.split('{trace_text}')[0]`` to BigQuery,
+  silently dropping every instruction that followed the
+  placeholders — including the per-criterion output-format spec
+  the judge model needs to score consistently with the
+  API-fallback path. The two BQ paths and the Python API path now
+  produce comparable scores against the same prompt.
 
 ### Added
 
-- ``EvaluationReport.details["execution_mode"]`` is now populated for
-  LLM-as-Judge runs with one of ``ai_generate``, ``ml_generate_text``,
-  ``api_fallback``, or ``no_op`` — matching the value space the
-  categorical evaluator already exposes. When an earlier tier raised
-  before a later tier succeeded, ``details["fallback_reason"]`` carries
-  the chained exception messages in attempt order, so CI and dashboards
-  can audit which path actually ran.
-- ``evaluators.split_judge_prompt_template(prompt_template)`` is the
-  helper the SQL paths use to safely substitute the template into
-  ``CONCAT()``; exposed publicly for downstream code that needs the
-  same shape.
+- ``evaluators.render_ai_generate_judge_query(...)`` is the new
+  entry point that builds the AI.GENERATE batch SQL.
+  ``connection_id`` is optional — when omitted the call uses
+  end-user credentials; when supplied it inlines the
+  ``connection_id =>`` argument so callers can route through a
+  service-account-owned connection when their environment
+  requires it.
+- ``Client.connection_id`` already existed; it is now plumbed
+  through to ``_ai_generate_judge`` so a connection set at client
+  construction propagates to the judge SQL automatically.
+- Live BigQuery integration tests for the LLM-judge AI.GENERATE
+  path (``tests/test_ai_generate_judge_live.py``). Skipped by
+  default; opt in with ``BQAA_RUN_LIVE_TESTS=1`` plus
+  ``PROJECT_ID`` / ``DATASET_ID``. Three tests cover SQL parse
+  acceptance, expected result-schema column names, and the
+  ``connection_id`` escape hatch when
+  ``BQAA_AI_GENERATE_CONNECTION_ID`` is set. Catches the class of
+  mock-divergence bug that let the prior broken template ship.
+- ``EvaluationReport.details["execution_mode"]`` is now populated
+  for LLM-as-Judge runs with one of ``ai_generate``,
+  ``ml_generate_text``, ``api_fallback``, or ``no_op`` — matching
+  the value space the categorical evaluator already exposes. When
+  an earlier tier raised before a later tier succeeded,
+  ``details["fallback_reason"]`` carries the chained exception
+  messages in attempt order, so CI and dashboards can audit which
+  path actually ran.
+- ``evaluators.split_judge_prompt_template(prompt_template)`` is
+  the helper the SQL paths use to safely substitute the template
+  into ``CONCAT()``; exposed publicly for downstream code that
+  needs the same shape.
+- ``bq-agent-sdk evaluate --exit-code`` FAIL lines now carry a
+  bounded ``feedback="…"`` snippet drawn from
+  ``SessionScore.llm_feedback`` for LLM-judge failures. The
+  snippet collapses internal whitespace to a single space,
+  truncates to 120 characters with an ellipsis, and is omitted
+  entirely for code-based metrics (which leave ``llm_feedback``
+  empty). CI logs now explain *why* the judge said the session
+  failed without forcing the reader to chase the JSON output.
+
+### Changed
+
+- ``--strict`` help text and ``SDK.md §4`` clarified to match shipped
+  behavior. ``--strict`` is a *visibility* knob — it stamps
+  ``details['parse_error']=True`` on AI.GENERATE/ML.GENERATE_TEXT
+  judge rows whose ``scores`` dict is empty, and adds a report-level
+  ``parse_errors`` counter. It does **not** flip any session's
+  pass/fail outcome: both BQ-native judge methods compute ``passed``
+  as ``bool(scores) and all(...)``, so empty-scores rows already
+  fail without the flag. API-fallback parse errors coerce to
+  ``score=0.0``, so they fail as low-score failures rather than
+  parse errors. For pass/fail-only CI consumers ``--strict`` is a
+  no-op; reach for it when a dashboard needs to tell "no parseable
+  score" apart from "low score."
 
 ## [0.2.2] - 2026-04-24
 
diff --git a/src/bigquery_agent_analytics/client.py b/src/bigquery_agent_analytics/client.py
@@ -78,6 +78,7 @@
 from .evaluators import EvaluationReport
 from .evaluators import LLM_JUDGE_BATCH_QUERY
 from .evaluators import LLMAsJudge
+from .evaluators import render_ai_generate_judge_query
 from .evaluators import SESSION_SUMMARY_QUERY
 from .evaluators import SessionScore
 from .evaluators import split_judge_prompt_template
@@ -1089,12 +1090,13 @@ def _ai_generate_judge(
         bq.ScalarQueryParameter("judge_prompt_suffix", "STRING", suffix),
     ]
 
-    query = AI_GENERATE_JUDGE_BATCH_QUERY.format(
+    query = render_ai_generate_judge_query(
         project=self.project_id,
         dataset=self.dataset_id,
         table=table,
         where=where,
         endpoint=self.endpoint,
+        connection_id=self.connection_id,
     )
     job_config = bq.QueryJobConfig(
         query_parameters=judge_params,
diff --git a/src/bigquery_agent_analytics/evaluators.py b/src/bigquery_agent_analytics/evaluators.py
@@ -862,7 +862,7 @@ def sentiment(
 LIMIT @trace_limit
 """
 
-AI_GENERATE_JUDGE_BATCH_QUERY = """\
+_AI_GENERATE_JUDGE_BATCH_QUERY_TEMPLATE = """\
 WITH session_traces AS (
   SELECT
     session_id,
@@ -891,25 +891,78 @@ def sentiment(
   session_id,
   trace_text,
   final_response,
-  result.*
-FROM session_traces,
-AI.GENERATE(
-  -- Substitute the full Python prompt_template at SQL time:
-  -- prefix ++ trace_text ++ middle ++ final_response ++ suffix.
-  -- Each segment is a separate query parameter so we preserve the
-  -- exact Python template (including the per-criterion output-format
-  -- spec) that the API-fallback path uses.
-  prompt => CONCAT(
-    @judge_prompt_prefix, trace_text,
-    @judge_prompt_middle, COALESCE(final_response, 'N/A'),
-    @judge_prompt_suffix
-  ),
-  endpoint => '{endpoint}',
-  model_params => JSON '{{"temperature": 0.1, "max_output_tokens": 500}}',
-  output_schema => 'score INT64, justification STRING'
-) AS result
+  gen.score AS score,
+  gen.justification AS justification,
+  gen.status AS gen_status
+FROM (
+  SELECT
+    session_id,
+    trace_text,
+    final_response,
+    AI.GENERATE(
+      -- The Python prompt template is rebuilt at SQL time:
+      --   prefix ++ trace_text ++ middle ++ final_response ++ suffix
+      -- Each segment is a separate query parameter so AI.GENERATE
+      -- sees the exact full Python template (including the
+      -- per-criterion output-format spec) the API-fallback path uses.
+      prompt => CONCAT(
+        @judge_prompt_prefix, trace_text,
+        @judge_prompt_middle, COALESCE(final_response, 'N/A'),
+        @judge_prompt_suffix
+      ),
+      endpoint => '{endpoint}',{connection_arg}
+      model_params => JSON '{{"generationConfig": {{"temperature": 0.1, "maxOutputTokens": 1024}}}}',
+      output_schema => 'score INT64, justification STRING'
+    ) AS gen
+  FROM session_traces
+)
 """
 
+
+def render_ai_generate_judge_query(
+    *,
+    project: str,
+    dataset: str,
+    table: str,
+    where: str,
+    endpoint: str,
+    connection_id: Optional[str] = None,
+) -> str:
+  """Render the AI.GENERATE judge batch query for a given config.
+
+  ``AI.GENERATE`` is BigQuery's scalar generative function (it returns a
+  ``STRUCT<score, justification, full_response, status, ...>`` shaped
+  by ``output_schema``). The function call lives inside a regular
+  ``SELECT`` — it is *not* a table-valued function, so the surrounding
+  ``FROM session_traces, AI.GENERATE(...)`` lateral-join syntax used
+  by older SDK versions does not parse against current BigQuery.
+
+  ``connection_id`` is optional. When supplied (e.g.
+  ``"us.bqaa_ai_generate"``) the call uses that connection's service
+  account; when omitted, AI.GENERATE runs against the end-user
+  credentials of whichever account submits the job. Both shapes are
+  documented forms of the same function.
+  """
+  if connection_id:
+    connection_arg = f"\n      connection_id => '{connection_id}',"
+  else:
+    connection_arg = ""
+  return _AI_GENERATE_JUDGE_BATCH_QUERY_TEMPLATE.format(
+      project=project,
+      dataset=dataset,
+      table=table,
+      where=where,
+      endpoint=endpoint,
+      connection_arg=connection_arg,
+  )
+
+
+# Public alias kept for downstream code that imports the raw template
+# string (e.g. for inspection / docs). Callers building queries should
+# use ``render_ai_generate_judge_query`` instead so the optional
+# ``connection_id`` arg is wired correctly.
+AI_GENERATE_JUDGE_BATCH_QUERY = _AI_GENERATE_JUDGE_BATCH_QUERY_TEMPLATE
+
 # Legacy template kept for backward compatibility with pre-created
 # BQ ML models.
 _LEGACY_LLM_JUDGE_BATCH_QUERY = """\
diff --git a/tests/test_ai_generate_judge_live.py b/tests/test_ai_generate_judge_live.py