Commit 7baa2b0
authored
* fix(llm-judge): full prompt parity + execution_mode/fallback_reason
Two coupled fixes for LLM-as-Judge: the BQ-native paths now see the
full Python prompt template, and the resulting report stamps which
path actually ran (and why earlier tiers fell back when applicable).
## Prompt parity (F1)
`_ai_generate_judge` and `_bqml_judge` previously sent only
`prompt_template.split("{trace_text}")[0]` to BigQuery — i.e., the
prefix up to the first placeholder. Everything after `{trace_text}`
in the Python template (including the per-criterion JSON output
spec the judge model needs to score consistently) was silently
dropped on the SQL paths. That made AI.GENERATE / ML.GENERATE_TEXT
score against a different prompt than the API-fallback path, which
uses the whole template via `str.format(...)`.
Fix:
- New helper `evaluators.split_judge_prompt_template(template)` that
format()s the template with `\x00`-bracketed sentinels for both
placeholders, then partitions the result into
`(prefix, middle, suffix)`. Sentinels avoid clashing with literal
template content; running the format pass ensures `{{...}}`
escapes are correctly un-escaped before partitioning, so the SQL
CONCAT sees the same string the API path produces.
- `AI_GENERATE_JUDGE_BATCH_QUERY` and `_LEGACY_LLM_JUDGE_BATCH_QUERY`
now `CONCAT(@judge_prompt_prefix, trace_text, @judge_prompt_middle,
COALESCE(final_response, 'N/A'), @judge_prompt_suffix)` — three
parameters instead of one.
- Both judge methods in `client.py` swap `judge_prompt` for the
three new parameters.
## execution_mode + fallback_reason (F2/F3)
`_evaluate_llm_judge` now stamps `report.details["execution_mode"]`
with one of `ai_generate`, `ml_generate_text`, `api_fallback`, or
`no_op` — matching the value-space the categorical evaluator
already uses. When an earlier tier raises before a later tier
succeeds, `report.details["fallback_reason"]` carries the chained
exception messages in attempt order so CI gates and dashboards can
audit which path actually ran. Categorical-style underscore naming
is intentional — readers reading both LLM-judge and categorical
reports see the same vocabulary.
## Tests
- Prompt parity: assert AI.GENERATE and ML.GENERATE_TEXT receive
three `judge_prompt_{prefix,middle,suffix}` params instead of the
single `judge_prompt`, and that concatenation reproduces the
full Python template (including the JSON output spec).
- Execution mode: assert each of ai_generate / ml_generate_text /
api_fallback fires under the right cascade conditions, and that
`fallback_reason` names the prior tiers in attempt order.
- `split_judge_prompt_template`: round-trip, missing-placeholder
fallback paths, full-template-as-prefix when neither placeholder
is present.
CHANGELOG entry added under `[Unreleased]`.
Required publish blocker for blog post #3 (#82). PR #2 in this
series will tighten the LLM-judge `evaluate --exit-code` FAIL
output to surface criterion + threshold + bounded `llm_feedback`
snippet.
Ref: #82, #51.
* fix(llm-judge): keep synthesized labels next to their values
Reviewer flagged that ``split_judge_prompt_template`` mishandles
custom templates with one placeholder. The SQL CONCAT runs
prefix ++ trace_text ++ middle ++ final_response ++ suffix
so a synthesized label for an absent placeholder must end up
*immediately before* the value it labels. Earlier fallback
branches placed the labels on the wrong side:
- ``{trace_text}`` only — ``Response:`` label landed AFTER the
injected response value.
- ``{final_response}`` only — ``Trace:`` label landed AFTER the
injected trace value, and the user's prompt prose ended up
AFTER the trace instead of before it.
- No placeholders — labels appended after their values for both
trace and response.
Built-in correctness/hallucination/sentiment templates were
unaffected because they declare both placeholders explicitly, so
the dual-placeholder branch (which was correct) handled them.
Fix: rewrite the three fallback branches so the SQL CONCAT yields
``...<user prompt>...\nTrace:\n<TRACE>\nResponse:\n<RESPONSE>...``
in every case. Updated the docstring to document the rebuild
contract precisely.
Tests: replaced the two-segment "label is somewhere in suffix"
assertions with full-rebuild assertions that verify ordering of
user prose, ``Trace:`` label, trace value, ``Response:`` label,
and response value. Three new regression tests cover all three
fallback branches.
Ref: PR #42 review feedback.
1 parent e948580 commit 7baa2b0
4 files changed
Lines changed: 466 additions & 19 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
7 | 7 | | |
8 | 8 | | |
9 | 9 | | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
10 | 34 | | |
11 | 35 | | |
12 | 36 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
80 | 80 | | |
81 | 81 | | |
82 | 82 | | |
| 83 | + | |
83 | 84 | | |
84 | 85 | | |
85 | 86 | | |
| |||
975 | 976 | | |
976 | 977 | | |
977 | 978 | | |
| 979 | + | |
| 980 | + | |
| 981 | + | |
| 982 | + | |
| 983 | + | |
| 984 | + | |
| 985 | + | |
| 986 | + | |
| 987 | + | |
978 | 988 | | |
979 | 989 | | |
980 | 990 | | |
981 | | - | |
| 991 | + | |
982 | 992 | | |
983 | 993 | | |
984 | 994 | | |
985 | 995 | | |
| 996 | + | |
| 997 | + | |
| 998 | + | |
| 999 | + | |
986 | 1000 | | |
987 | 1001 | | |
988 | 1002 | | |
| |||
997 | 1011 | | |
998 | 1012 | | |
999 | 1013 | | |
1000 | | - | |
| 1014 | + | |
1001 | 1015 | | |
1002 | 1016 | | |
1003 | 1017 | | |
1004 | 1018 | | |
1005 | 1019 | | |
| 1020 | + | |
| 1021 | + | |
1006 | 1022 | | |
1007 | 1023 | | |
1008 | 1024 | | |
1009 | 1025 | | |
1010 | 1026 | | |
| 1027 | + | |
1011 | 1028 | | |
1012 | 1029 | | |
1013 | 1030 | | |
| |||
1028 | 1045 | | |
1029 | 1046 | | |
1030 | 1047 | | |
1031 | | - | |
| 1048 | + | |
1032 | 1049 | | |
1033 | 1050 | | |
1034 | 1051 | | |
1035 | 1052 | | |
1036 | 1053 | | |
| 1054 | + | |
| 1055 | + | |
| 1056 | + | |
| 1057 | + | |
1037 | 1058 | | |
1038 | 1059 | | |
1039 | 1060 | | |
1040 | 1061 | | |
1041 | 1062 | | |
| 1063 | + | |
1042 | 1064 | | |
1043 | 1065 | | |
1044 | | - | |
| 1066 | + | |
| 1067 | + | |
| 1068 | + | |
| 1069 | + | |
| 1070 | + | |
1045 | 1071 | | |
1046 | 1072 | | |
1047 | 1073 | | |
| |||
1054 | 1080 | | |
1055 | 1081 | | |
1056 | 1082 | | |
| 1083 | + | |
| 1084 | + | |
| 1085 | + | |
1057 | 1086 | | |
1058 | | - | |
1059 | | - | |
1060 | | - | |
1061 | | - | |
1062 | | - | |
| 1087 | + | |
| 1088 | + | |
| 1089 | + | |
1063 | 1090 | | |
1064 | 1091 | | |
1065 | 1092 | | |
| |||
1121 | 1148 | | |
1122 | 1149 | | |
1123 | 1150 | | |
| 1151 | + | |
| 1152 | + | |
| 1153 | + | |
1124 | 1154 | | |
1125 | | - | |
1126 | | - | |
1127 | | - | |
1128 | | - | |
1129 | | - | |
| 1155 | + | |
| 1156 | + | |
| 1157 | + | |
1130 | 1158 | | |
1131 | 1159 | | |
1132 | 1160 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
894 | 894 | | |
895 | 895 | | |
896 | 896 | | |
| 897 | + | |
| 898 | + | |
| 899 | + | |
| 900 | + | |
| 901 | + | |
897 | 902 | | |
898 | | - | |
899 | | - | |
| 903 | + | |
| 904 | + | |
| 905 | + | |
900 | 906 | | |
901 | 907 | | |
902 | 908 | | |
| |||
938 | 944 | | |
939 | 945 | | |
940 | 946 | | |
941 | | - | |
942 | | - | |
943 | | - | |
| 947 | + | |
| 948 | + | |
| 949 | + | |
| 950 | + | |
| 951 | + | |
| 952 | + | |
| 953 | + | |
944 | 954 | | |
945 | 955 | | |
946 | 956 | | |
| |||
951 | 961 | | |
952 | 962 | | |
953 | 963 | | |
| 964 | + | |
| 965 | + | |
| 966 | + | |
| 967 | + | |
| 968 | + | |
| 969 | + | |
| 970 | + | |
| 971 | + | |
| 972 | + | |
| 973 | + | |
| 974 | + | |
| 975 | + | |
| 976 | + | |
| 977 | + | |
| 978 | + | |
| 979 | + | |
| 980 | + | |
| 981 | + | |
| 982 | + | |
| 983 | + | |
| 984 | + | |
| 985 | + | |
| 986 | + | |
| 987 | + | |
| 988 | + | |
| 989 | + | |
| 990 | + | |
| 991 | + | |
| 992 | + | |
| 993 | + | |
| 994 | + | |
| 995 | + | |
| 996 | + | |
| 997 | + | |
| 998 | + | |
| 999 | + | |
| 1000 | + | |
| 1001 | + | |
| 1002 | + | |
| 1003 | + | |
| 1004 | + | |
| 1005 | + | |
| 1006 | + | |
| 1007 | + | |
| 1008 | + | |
| 1009 | + | |
| 1010 | + | |
| 1011 | + | |
| 1012 | + | |
| 1013 | + | |
| 1014 | + | |
| 1015 | + | |
| 1016 | + | |
| 1017 | + | |
| 1018 | + | |
| 1019 | + | |
| 1020 | + | |
| 1021 | + | |
| 1022 | + | |
| 1023 | + | |
| 1024 | + | |
| 1025 | + | |
| 1026 | + | |
| 1027 | + | |
| 1028 | + | |
| 1029 | + | |
| 1030 | + | |
| 1031 | + | |
| 1032 | + | |
| 1033 | + | |
| 1034 | + | |
| 1035 | + | |
| 1036 | + | |
| 1037 | + | |
| 1038 | + | |
| 1039 | + | |
| 1040 | + | |
| 1041 | + | |
| 1042 | + | |
| 1043 | + | |
| 1044 | + | |
| 1045 | + | |
| 1046 | + | |
| 1047 | + | |
| 1048 | + | |
| 1049 | + | |
| 1050 | + | |
| 1051 | + | |
| 1052 | + | |
| 1053 | + | |
| 1054 | + | |
| 1055 | + | |
| 1056 | + | |
954 | 1057 | | |
955 | 1058 | | |
956 | 1059 | | |
| |||
0 commit comments