Commit a0b21fb
fix(llm-judge): make AI.GENERATE template execute against current BigQuery (#87)
* fix(llm-judge): full prompt parity + execution_mode/fallback_reason (#42)
* fix(llm-judge): full prompt parity + execution_mode/fallback_reason
Two coupled fixes for LLM-as-Judge: the BQ-native paths now see the
full Python prompt template, and the resulting report stamps which
path actually ran (and why earlier tiers fell back when applicable).
## Prompt parity (F1)
`_ai_generate_judge` and `_bqml_judge` previously sent only
`prompt_template.split("{trace_text}")[0]` to BigQuery — i.e., the
prefix up to the first placeholder. Everything after `{trace_text}`
in the Python template (including the per-criterion JSON output
spec the judge model needs to score consistently) was silently
dropped on the SQL paths. That made AI.GENERATE / ML.GENERATE_TEXT
score against a different prompt than the API-fallback path, which
uses the whole template via `str.format(...)`.
Fix:
- New helper `evaluators.split_judge_prompt_template(template)` that
format()s the template with `\x00`-bracketed sentinels for both
placeholders, then partitions the result into
`(prefix, middle, suffix)`. Sentinels avoid clashing with literal
template content; running the format pass ensures `{{...}}`
escapes are correctly un-escaped before partitioning, so the SQL
CONCAT sees the same string the API path produces.
- `AI_GENERATE_JUDGE_BATCH_QUERY` and `_LEGACY_LLM_JUDGE_BATCH_QUERY`
now `CONCAT(@judge_prompt_prefix, trace_text, @judge_prompt_middle,
COALESCE(final_response, 'N/A'), @judge_prompt_suffix)` — three
parameters instead of one.
- Both judge methods in `client.py` swap `judge_prompt` for the
three new parameters.
## execution_mode + fallback_reason (F2/F3)
`_evaluate_llm_judge` now stamps `report.details["execution_mode"]`
with one of `ai_generate`, `ml_generate_text`, `api_fallback`, or
`no_op` — matching the value-space the categorical evaluator
already uses. When an earlier tier raises before a later tier
succeeds, `report.details["fallback_reason"]` carries the chained
exception messages in attempt order so CI gates and dashboards can
audit which path actually ran. Categorical-style underscore naming
is intentional — readers reading both LLM-judge and categorical
reports see the same vocabulary.
## Tests
- Prompt parity: assert AI.GENERATE and ML.GENERATE_TEXT receive
three `judge_prompt_{prefix,middle,suffix}` params instead of the
single `judge_prompt`, and that concatenation reproduces the
full Python template (including the JSON output spec).
- Execution mode: assert each of ai_generate / ml_generate_text /
api_fallback fires under the right cascade conditions, and that
`fallback_reason` names the prior tiers in attempt order.
- `split_judge_prompt_template`: round-trip, missing-placeholder
fallback paths, full-template-as-prefix when neither placeholder
is present.
CHANGELOG entry added under `[Unreleased]`.
Required publish blocker for blog post #3 (#82). PR #2 in this
series will tighten the LLM-judge `evaluate --exit-code` FAIL
output to surface criterion + threshold + bounded `llm_feedback`
snippet.
Ref: #82, #51.
* fix(llm-judge): keep synthesized labels next to their values
Reviewer flagged that ``split_judge_prompt_template`` mishandles
custom templates with one placeholder. The SQL CONCAT runs
prefix ++ trace_text ++ middle ++ final_response ++ suffix
so a synthesized label for an absent placeholder must end up
*immediately before* the value it labels. Earlier fallback
branches placed the labels on the wrong side:
- ``{trace_text}`` only — ``Response:`` label landed AFTER the
injected response value.
- ``{final_response}`` only — ``Trace:`` label landed AFTER the
injected trace value, and the user's prompt prose ended up
AFTER the trace instead of before it.
- No placeholders — labels appended after their values for both
trace and response.
Built-in correctness/hallucination/sentiment templates were
unaffected because they declare both placeholders explicitly, so
the dual-placeholder branch (which was correct) handled them.
Fix: rewrite the three fallback branches so the SQL CONCAT yields
``...<user prompt>...\nTrace:\n<TRACE>\nResponse:\n<RESPONSE>...``
in every case. Updated the docstring to document the rebuild
contract precisely.
Tests: replaced the two-segment "label is somewhere in suffix"
assertions with full-rebuild assertions that verify ordering of
user prose, ``Trace:`` label, trace value, ``Response:`` label,
and response value. Three new regression tests cover all three
fallback branches.
Ref: PR #42 review feedback.
* feat(evaluate): LLM-judge FAIL output with bounded feedback + --strict docs (#44)
* feat(evaluate): LLM-judge FAIL output with bounded feedback + --strict docs
Builds on the runtime-behavior fixes from #42 (now merged on main):
the report shape this PR depends on — ``SessionScore.llm_feedback``
populated by both BQ-native and API-fallback judge paths, and
``EvaluationReport.details["execution_mode"]`` distinguishing the
three tiers — is stable as of 0.2.2 + #42.
## F5 — feedback snippet on LLM-judge FAIL lines
``_emit_evaluate_failures`` now appends a bounded
``feedback="…"`` field after ``score`` / ``threshold`` whenever a
failing session's ``SessionScore.llm_feedback`` is non-empty. The
snippet:
- Collapses internal whitespace runs (newlines, tabs, doubled
spaces) to a single space, so multi-line judge justifications
stay on one CI log line.
- Truncates at 120 characters with a U+2026 ellipsis when the
collapsed string is longer.
- Is omitted entirely for code-based metrics (their
``llm_feedback`` is ``None``), so post #2's deterministic FAIL
output stays visually identical.
Pulled the formatter into a small public-internal helper
(``_format_feedback_snippet``) so the CLI tests can exercise it
directly without round-tripping through Click.
The safety-net "no per-metric detail available" branch also
carries the snippet now — failing sessions with no metric details
still surface the judge's reason rather than a generic placeholder.
## F4 — ``--strict`` help/docs match shipped behavior
The previous ``--strict`` help text ("Fail sessions with
unparseable judge output") promised a broader effect than what
ships. In reality:
- API-fallback parse errors already coerce to ``score=0.0`` and
fail any non-zero threshold without ``--strict``.
- ``--strict`` only changes AI.GENERATE rows whose typed output is
empty/NULL: without it they're silently skipped (the default
``passed=True`` for empty-scores SessionScores); with it they're
explicitly failed and counted under ``report.details``.
Updated the CLI help string to name AI.GENERATE specifically and
to call out that API-fallback parse errors don't need the flag.
Rewrote ``SDK.md §4 Strict Mode`` to lead with the AI.GENERATE
path scope before getting into operational counters.
## Tests (10 new)
- LLM-judge FAIL line carries the feedback snippet, with the
judge's actual prose visible in CI output.
- Long feedback truncates at the 120-char cap with U+2026.
- Multi-line feedback collapses to a single CI log line.
- Code-based metric failures emit no ``feedback`` field
(regression guard against bleed from the LLM path).
- Direct unit tests on ``_format_feedback_snippet``: ``None`` /
empty / whitespace-only inputs return ``None``; short inputs
pass through; whitespace runs collapse; default and custom
``max_chars`` truncation respected.
CHANGELOG entry added under ``[Unreleased]`` (Added + Changed
sections).
Required publish blocker for blog post #3 (#82). Pairs with #42 as
PR 2 of 2 in that series.
Ref: #82, #51.
* docs(strict): describe parse-error visibility, not pass/fail flipping
Reviewer flagged that the prior --strict rewrite still mischaracterized
the AI.GENERATE behavior: it claimed strict=False silently skips
empty/NULL typed-output rows and leaves them passing, then says
strict=True flips them to failing. That's wrong on both ends.
Actual code path:
- ``_ai_generate_judge`` (and ``_bqml_judge``) compute
``passed = bool(scores) and all(score >= threshold ...)``.
Empty-scores rows already have passed=False — locked in by
TestFalsePassFix.test_empty_score_fails on the runtime side.
- ``_apply_strict_mode`` walks the report and adds
``details['parse_error']=True`` to every empty-scores session
plus a report-level ``parse_errors`` / ``parse_error_rate``
counter. It does not change any session's pass/fail status.
- The API-fallback path coerces malformed output to ``score=0.0``,
so its parse failures fail as low-score failures and don't
surface through ``--strict``.
Net: ``--strict`` is a visibility knob, not a gate-affecting flag.
For pass/fail-only CI consumers it's a no-op; for dashboards or
investigations it lets you tell ``low score`` failures apart from
``no parseable score`` failures.
Updated:
- ``cli.py:strict`` help text — replaces the misleading
"fail/silently-skipped" framing with the parse-error metadata
framing the code actually implements.
- ``SDK.md §4 Strict Mode`` — leads with "adds parse-error
visibility, does not flip pass/fail," then enumerates exactly
what ``_apply_strict_mode`` does.
- ``CHANGELOG.md`` ``[Unreleased]`` entry — same correction.
No code or test changes — only docs/help. Existing pytest run
still 2080 passed, 4 skipped.
Ref: PR #44 review feedback.
* fix(llm-judge): make AI.GENERATE template execute against current BigQuery (#45)
* fix(llm-judge): make AI.GENERATE template execute against current BigQuery
Earlier SDK versions emitted
FROM session_traces, AI.GENERATE(
prompt => ...,
endpoint => ...,
model_params => JSON '{"temperature": 0.1, "max_output_tokens": 500}',
output_schema => 'score INT64, justification STRING'
) AS result
That shape no longer parses against current BigQuery: AI.GENERATE
is a scalar function (returns ``STRUCT<result, full_response,
status, ...>`` shaped by ``output_schema``), not a table-valued
function. The flat ``model_params`` dict also rejects with
``does not conform to the GenerateContent request body`` — the
new shape is ``{"generationConfig": {"temperature": ...,
"maxOutputTokens": ...}}``. Mocked unit tests bypassed real
query execution and never noticed.
## What this PR does
1. Replaces the table-valued ``FROM ..., AI.GENERATE(...)`` with
a scalar ``SELECT AI.GENERATE(...).score, ....`` form, wrapped
in an outer SELECT that flattens the struct to columns the
existing ``_ai_generate_judge`` row-iteration code already
reads (``score``, ``justification``, ``gen_status``).
2. Switches ``model_params`` to the
``{"generationConfig": {...}}`` shape required by current
``AI.GENERATE``.
3. Bumps ``maxOutputTokens`` from 500 to 1024 so the judge has
enough room to return both a score and a justification on
verbose criteria; the previous 500-token cap clipped some
responses to MAX_TOKENS.
4. Adds ``evaluators.render_ai_generate_judge_query(...)`` so the
optional ``connection_id`` argument can be inlined into the
SQL only when supplied. ``connection_id`` is now an
*optional escape hatch*: omit it and AI.GENERATE runs on the
submitter's end-user credentials; supply
``Client(..., connection_id="us.foo")`` and the call routes
through that connection's service account.
5. Plumbs ``self.connection_id`` from ``Client`` through to
``_ai_generate_judge`` so a connection set at client
construction propagates to the judge SQL automatically.
## Live verification
Adds ``tests/test_ai_generate_judge_live.py`` with three gated
tests that submit the exact rendered SQL to a real BigQuery
project. Skipped by default; set ``BQAA_RUN_LIVE_TESTS=1`` plus
``PROJECT_ID`` / ``DATASET_ID`` to opt in. Optionally set
``BQAA_AI_GENERATE_CONNECTION_ID`` to also exercise the
connection-supplied form. Smoke run against
``test-project-0728-467323.agent_analytics_demo``: all 3 pass.
End-to-end ``client.evaluate(evaluator=LLMAsJudge.correctness())``
through the public API now returns ``execution_mode='ai_generate'``,
real correctness scores, and real LLM justifications against live
BigQuery; verified against three sessions in the sandbox project.
## Test plan
- [x] ``pytest tests/`` (mocked path) -> 2080 passed, 7 skipped
(was 4 before; the 3 new live tests skip without the env gate).
- [x] ``BQAA_RUN_LIVE_TESTS=1 ... pytest tests/test_ai_generate_judge_live.py``
-> 3 passed against live BQ.
- [x] End-to-end ``Client.evaluate`` smoke against live BQ with
``LLMAsJudge.correctness()``: real scores + justifications
returned, ``execution_mode == 'ai_generate'``.
## Why this matters
This is a quiet publish blocker for blog post #3 (#82). The post's
central claim — "AI.GENERATE keeps evaluation in BigQuery" — was
false against the released SDK because the SDK couldn't actually
execute its own AI.GENERATE template. The mock-only test coverage
let the bug ship in 0.2.x. The new gated live test catches this
class of mock-divergence going forward.
Ref: #82, #51.
* docs(changelog): merge duplicate Unreleased Added sections
Reviewer flagged that PR #45's CHANGELOG edits left the
[Unreleased] section with two ### Added blocks and the prior
"AI.GENERATE / ML.GENERATE_TEXT prompt template" bug fix
stranded between them under the wrong heading.
Restructured to one ### Fixed (both LLM-judge bug fixes), one
### Added (new entry points + live tests + execution_mode +
split helper + feedback snippet), and one ### Changed (the
strict-mode docs) — Keep-a-Changelog conformant.
No content changes — just consolidating duplicate headers and
moving the orphan bullet under the right section.
Ref: PR #45 review feedback.
---------
Co-authored-by: Haiyuan Cao <haiyuan@google.com>1 parent 5db449c commit a0b21fb
4 files changed
Lines changed: 352 additions & 37 deletions
File tree
- src/bigquery_agent_analytics
- tests
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
9 | 9 | | |
10 | 10 | | |
11 | 11 | | |
12 | | - | |
13 | | - | |
14 | | - | |
15 | | - | |
16 | | - | |
17 | | - | |
18 | | - | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
19 | 34 | | |
20 | 35 | | |
21 | 36 | | |
22 | | - | |
23 | | - | |
24 | | - | |
25 | | - | |
26 | | - | |
27 | | - | |
28 | | - | |
29 | | - | |
30 | | - | |
31 | | - | |
32 | | - | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
33 | 90 | | |
34 | 91 | | |
35 | 92 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
78 | 78 | | |
79 | 79 | | |
80 | 80 | | |
| 81 | + | |
81 | 82 | | |
82 | 83 | | |
83 | 84 | | |
| |||
1089 | 1090 | | |
1090 | 1091 | | |
1091 | 1092 | | |
1092 | | - | |
| 1093 | + | |
1093 | 1094 | | |
1094 | 1095 | | |
1095 | 1096 | | |
1096 | 1097 | | |
1097 | 1098 | | |
| 1099 | + | |
1098 | 1100 | | |
1099 | 1101 | | |
1100 | 1102 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
862 | 862 | | |
863 | 863 | | |
864 | 864 | | |
865 | | - | |
| 865 | + | |
866 | 866 | | |
867 | 867 | | |
868 | 868 | | |
| |||
891 | 891 | | |
892 | 892 | | |
893 | 893 | | |
894 | | - | |
895 | | - | |
896 | | - | |
897 | | - | |
898 | | - | |
899 | | - | |
900 | | - | |
901 | | - | |
902 | | - | |
903 | | - | |
904 | | - | |
905 | | - | |
906 | | - | |
907 | | - | |
908 | | - | |
909 | | - | |
910 | | - | |
| 894 | + | |
| 895 | + | |
| 896 | + | |
| 897 | + | |
| 898 | + | |
| 899 | + | |
| 900 | + | |
| 901 | + | |
| 902 | + | |
| 903 | + | |
| 904 | + | |
| 905 | + | |
| 906 | + | |
| 907 | + | |
| 908 | + | |
| 909 | + | |
| 910 | + | |
| 911 | + | |
| 912 | + | |
| 913 | + | |
| 914 | + | |
| 915 | + | |
| 916 | + | |
| 917 | + | |
| 918 | + | |
911 | 919 | | |
912 | 920 | | |
| 921 | + | |
| 922 | + | |
| 923 | + | |
| 924 | + | |
| 925 | + | |
| 926 | + | |
| 927 | + | |
| 928 | + | |
| 929 | + | |
| 930 | + | |
| 931 | + | |
| 932 | + | |
| 933 | + | |
| 934 | + | |
| 935 | + | |
| 936 | + | |
| 937 | + | |
| 938 | + | |
| 939 | + | |
| 940 | + | |
| 941 | + | |
| 942 | + | |
| 943 | + | |
| 944 | + | |
| 945 | + | |
| 946 | + | |
| 947 | + | |
| 948 | + | |
| 949 | + | |
| 950 | + | |
| 951 | + | |
| 952 | + | |
| 953 | + | |
| 954 | + | |
| 955 | + | |
| 956 | + | |
| 957 | + | |
| 958 | + | |
| 959 | + | |
| 960 | + | |
| 961 | + | |
| 962 | + | |
| 963 | + | |
| 964 | + | |
| 965 | + | |
913 | 966 | | |
914 | 967 | | |
915 | 968 | | |
| |||
0 commit comments