You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
EvalBench is the agent benchmark harness. It writes scenario results to evalbench.results and evalbench.scores in BigQuery via its existing reporting/bqstore.py. EvalBench answers "can the agent complete this scenario?"
BQAA is the trace warehouse and semantic evaluation layer. It answers "what happened inside the run, and how does quality trend over runs?"
This issue tracks the work to let BQAA read EvalBench's existing BigQuery output and apply BQAA's evaluation surface (LLMAsJudge, scorecards, CI gates) over those rows — without coupling EvalBench to the ADK plugin's agent_events schema.
The full architectural rationale (pull-first vs push-first, two-direction bridge framing, EvalBench-side phasing) lives in evalbench#357. This issue scopes the SDK-side implementation only.
Why this lives in BQAA
The work is read-only against EvalBench's existing BigQuery output. No EvalBench runtime code change is required for the MVP — only docs/examples on their side.
The SDK already has the evaluation surface (Client.evaluate(...), LLMAsJudge.hallucination/correctness/sentiment, TraceFilter) and BigQuery query plumbing (location, project_id, dataset_id, labels). Adding an importer is additive.
Pull-first avoids the schema-coupling problem of having EvalBench write to the ADK plugin's agent_events table.
Scope
In scope (this issue)
A new optional submodule bigquery_agent_analytics.evalbench with an EvalBenchRun class.
A BQAA-owned mirror-table writer (synthetic agent_events rows materialized into a separate, importer-owned table).
Two CLI commands: evalbench-import and evalbench-score.
Tests proving Client.get_session_trace(...) and Client.evaluate(LLMAsJudge.hallucination(...), ...) work against the mirror table.
One end-to-end docs example using datasets/gemini-cli-tools/ from EvalBench.
Importer-owned evalbench_scores_imported table for EvalBench scorer rows (until unified evaluation_results lands; see BigQuery Agent Analytics Roadmap #96).
Out of scope (this issue)
Push-based EvalBench reporter (reporting/bqaa.py in EvalBench). Tracked as a deferred phase in evalbench#357.
Direction B (EvalBench reads BQAA agent_events as eval input, or LLMRater mode: ai_generate). Tracked in evalbench#357 Phase 5.
Compatibility view from evalbench_scores_imported into unified evaluation_results. Follow-up once BigQuery Agent Analytics Roadmap #96 lands the schema.
Modifying Client.evaluate(...) itself. The importer should make existing APIs work, not change them.
Module: bigquery_agent_analytics.evalbench
New optional submodule. Importing it should not require new runtime dependencies beyond what BQAA already pulls in (google-cloud-bigquery, google-genai).
# src/bigquery_agent_analytics/evalbench.pyfromdataclassesimportdataclassfromtypingimportOptional@dataclassclassEvalBenchRun:
"""One EvalBench job materialized as importable BQAA trace rows."""project_id: strevalbench_dataset: strjob_id: strlocation: Optional[str] =None# matches EvalBench BigQueryReporter dataset_location@classmethoddeffrom_bigquery(
cls,
*,
project_id: str,
evalbench_dataset: str,
job_id: str,
location: Optional[str] =None,
) ->"EvalBenchRun":
"""Loads the run's results + scores from BigQuery into memory. MVP scope: one job_id at a time, loaded into memory. Source queries MUST filter by job_id at the SQL level (WHERE job_id = @job_id) so scan cost stays bounded even when evalbench.results grows. Future: stream/page large runs without loading everything in memory. """
...
defmaterialize(
self,
*,
target_project: str,
target_dataset: str,
target_table: str="evalbench_agent_events",
scores_table: str="evalbench_scores_imported",
write_disposition: str="WRITE_APPEND",
) ->"MaterializeResult":
"""Writes synthetic BQAA trace rows + scorer rows into BQAA-owned mirror tables. Must NOT write into the ADK plugin's production agent_events table. Idempotency: WRITE_APPEND will produce duplicate rows on repeated imports of the same job_id unless the writer first deletes existing rows where attributes.experiment_id = job_id. MVP behavior: - WRITE_APPEND → delete-by-experiment_id before insert (idempotent) - WRITE_TRUNCATE → drop and re-create the table (idempotent, full reset) See open question #3 for the recommended CI default. """
...
Mirror-table contract (authoritative)
The mirror table must satisfy BQAA's existing trace and evaluation queries. Verified against:
experiment_id is read as JSON_VALUE(attributes, '$.experiment_id') (trace.py:594). Also holds usage_metadata.prompt_token_count, input_tokens.
latency_ms
JSON
SESSION_SUMMARY_QUERY reads $.total_ms and $.time_to_first_token_ms
status
STRING
has_error filter, error-rate aggregation
error_message
STRING
has_error filter
is_truncated
BOOL
_GET_TRACE_QUERY projection (FALSE acceptable)
Implementation traps
content.text_summary is required on every row that should appear in judge trace text. This is the single most important non-obvious requirement. _AI_GENERATE_JUDGE_BATCH_QUERY_TEMPLATE (evaluators.py:865–887, and the parallel ML.GENERATE_TEXT path at line 976) builds trace_text via STRING_AGG(CONCAT(event_type, ': ', COALESCE(JSON_VALUE(content, '$.text_summary'), '')) ORDER BY timestamp) and then filters with HAVING LENGTH(trace_text) > 10. If imported rows omit text_summary, trace_text collapses to event_type: repeated and gets dropped before the judge ever sees it. The mirror table can look schema-valid while Client.evaluate(LLMAsJudge.hallucination(...)) evaluates zero sessions.
experiment_id is not a top-level column. It must be written into attributes JSON as {"experiment_id": "<job_id>", ...} so TraceFilter(experiment_id=...) works without code changes (trace.py:594).
AGENT_COMPLETED is the minimum viable final-response row. BQAA's trace evaluator checks LLM_RESPONSE first, then falls back to AGENT_COMPLETED (trace_evaluator.py:226–234). Emitting AGENT_COMPLETED alone is sufficient; emitting LLM_RESPONSE too is fine but not required.
Final-response JSON key is content.$.response (resolved). The batch LLM judge picks the most recent non-null JSON_VALUE(content, '$.response') as final_response (evaluators.py:879, mirrored at evaluators.py:982). Trace extraction also accepts content.text_summary as a fallback.
Synthetic row construction (per scenario)
Every row that should contribute to judge trace_text must populate content.text_summary. Without it the row is effectively invisible to the LLM judge query.
One USER_MESSAGE_RECEIVED row with content.text = nl_prompt AND content.text_summary = nl_prompt.
One AGENT_COMPLETED row with content.response = generated_sql or final_response AND content.text_summary = <same string, or a one-line digest>.
Zero or more TOOL_STARTING / TOOL_COMPLETED pairs when EvalBench has tool-call data; populate content.text_summary with "<tool_name>(<args>)" and "<tool_name> -> <result_or_error>" respectively.
All rows share session_id = "evalbench:{job_id}:{scenario_id}".
All rows have attributes.experiment_id = job_id and attributes.evalbench_scenario_id = scenario_id.
All rows have agent = "evalbench:{orchestrator}:{generator}" (read from EvalBench's experiment_config).
Sanity check for implementers. A correctly-imported scenario produces non-empty trace_text under this expression (the same one the judge query uses):
with LENGTH(trace_text) > 10. If this collapses to whitespace or to a string of event_type: prefixes, text_summary is missing.
EvalBench result schema variability
evalbench.results and evalbench.scores are shared across NL2SQL and Gemini CLI / agentic flows. Not every row has the fields a semantic judge needs.
Gemini CLI / agentic rows are the primary semantic-judge target (datasets/gemini-cli-tools/ in EvalBench has gemini-cli.evalset.json and gemini-cli-fake.evalset.json).
NL2SQL rows can be imported for reporting joins (Looker view joining EvalBench scores with BQAA scorecard outputs by job_id), but applying LLMAsJudge.hallucination/correctness to them requires a renderable final response / trace text. Skip the synthetic AGENT_COMPLETED row gracefully when no usable response field is present.
Missing tool-call data must not fail the import. Emit zero TOOL_* rows and continue.
Missing nl_prompt is a hard failure — every imported scenario needs a user message.
CLI surface
Two new commands matching the SDK's existing single-level hyphenated pattern (categorical-eval, ontology-build, etc.).
Project flags. EvalBench source tables and BQAA mirror tables can live in different projects (common for cross-team setups where EvalBench writes to a benchmarks project and BQAA reads in an analytics project). --target-project defaults to --project-id so the same-project case stays a one-flag invocation; pass it explicitly when sources and targets differ. evalbench-score only needs --project-id because by that point the data lives in the target project.
--evaluator accepts the prebuilt names: hallucination, correctness, sentiment (mapping to LLMAsJudge.hallucination(...) etc. at evaluators.py:704+).
--exit-code reuses the existing CLI flag pattern from evaluate --exit-code (cli.py:268) so EvalBench scores can gate CI.
Python API example
frombigquery_agent_analyticsimportClientfrombigquery_agent_analytics.evaluatorsimportLLMAsJudgefrombigquery_agent_analytics.traceimportTraceFilterfrombigquery_agent_analytics.evalbenchimportEvalBenchRunrun=EvalBenchRun.from_bigquery(
project_id="my-project",
evalbench_dataset="evalbench",
job_id="abc123",
location="US",
)
result=run.materialize(
target_project="my-project",
target_dataset="agent_analytics",
target_table="evalbench_agent_events",
)
print(f"Imported {result.session_count} sessions, {result.score_count} scores")
client=Client(
project_id="my-project",
dataset_id="agent_analytics",
table_id="evalbench_agent_events", # required so get_session_trace() reads the mirror tablelocation="US",
)
# get_session_trace() uses Client.table_id directly (client.py:788–791); it has no dataset override.trace=client.get_session_trace("evalbench:abc123:scenario-1")
print(f"{len(trace.spans)} spans for scenario-1")
# evaluate() does accept a dataset (table-name) override, but with table_id already set on the# client we can omit it and the call still targets the mirror table.report=client.evaluate(
evaluator=LLMAsJudge.hallucination(threshold=0.7),
filters=TraceFilter(experiment_id="abc123"),
strict=True,
)
print(report.summary())
Tests
Unit tests:
tests/test_evalbench_importer.py
test_synthetic_row_columns_match_get_trace_query() — assert rows can be projected by _GET_TRACE_QUERY without missing-column errors. Use a stubbed BigQuery client.
test_synthetic_rows_populate_text_summary_for_judge_trace_text() — assert imported Gemini CLI rows produce non-empty trace_text under the same STRING_AGG(CONCAT(event_type, ': ', COALESCE(JSON_VALUE(content, '$.text_summary'), '')) ORDER BY timestamp) expression the LLM judge query uses, and that LENGTH(trace_text) > 10. Catches the silent "schema-valid but judge sees zero sessions" failure mode.
test_synthetic_rows_populate_response_for_final_response() — assert at least one row per session has JSON_VALUE(content, '$.response') IS NOT NULL, matching evaluators.py:879.
test_missing_final_response_skips_agent_completed_row() — NL2SQL rows without renderable response.
test_nl_prompt_required() — hard failure when missing.
test_repeated_import_is_idempotent_under_write_append() — running materialize() twice with the same job_id and WRITE_APPEND produces the same row count as a single run; rows are deleted by attributes.experiment_id before insert.
test_target_project_can_differ_from_source_project() — from_bigquery(project_id="src") + materialize(target_project="dst", ...) writes to dst.<dataset>.<table>.
Integration tests (gated on BQAA_LIVE=1 env var, same gate pattern as existing live tests):
tests/integration/test_evalbench_live.py
Materialize a fixture EvalBench job into a temp dataset.
Call Client.evaluate(LLMAsJudge.hallucination(...), filters=TraceFilter(experiment_id=job_id), dataset="evalbench_agent_events") — assert at least one per-session score, and report.details["execution_mode"] in {"ai_generate", "ml_generate_text", "api_fallback"}.
Docs
docs/evalbench.md — module overview, mirror-table contract, CLI reference.
evalbench-score CLI (thin wrapper around Client.evaluate)
0.25 wk
P0
Unit + integration tests
0.5 wk
P0
Docs + Gemini CLI live demo
0.5 wk
P1
evalbench_scores_imported schema + writer
0.5 wk
P1
Looker view template joining EvalBench scores with BQAA scorecard outputs
0.5 wk
MVP total: 2–3 eng-weeks.
Acceptance criteria
bigquery_agent_analytics.evalbench.EvalBenchRun.from_bigquery(...) reads one EvalBench job_id from evalbench.results + evalbench.scores in any BigQuery location, and source queries filter by job_id at the SQL level.
EvalBenchRun.materialize(...) accepts target_project separate from the source project_id (cross-project import) and writes synthetic agent-events rows into a BQAA-owned mirror table (default evalbench_agent_events) — never into the ADK plugin's production agent_events.
Repeated imports of the same job_id are idempotent — no duplicate rows in the mirror table after the second import.
Imported rows preserve job_id (in attributes.experiment_id), scenario id (in session_id), scorer name, score, generated output, final response or generated SQL, and error fields.
Client(project_id=..., dataset_id=..., table_id="evalbench_agent_events", location=...).get_session_trace("evalbench:{job_id}:{scenario_id}") returns a non-empty Trace against the mirror table. (get_session_trace() uses Client.table_id directly — client.py:788–791 — and has no dataset= override, unlike Client.evaluate(...).)
Client.evaluate(LLMAsJudge.hallucination(threshold=0.7), filters=TraceFilter(experiment_id=job_id)) returns an EvaluationReport with at least one per-session score in live integration tests, with report.details["execution_mode"] in {"ai_generate", "ml_generate_text", "api_fallback"}.
Imported rows produce non-empty trace_text under the LLM judge's STRING_AGG(CONCAT(event_type, ': ', JSON_VALUE(content, '$.text_summary'))) expression, with LENGTH(trace_text) > 10.
bq-agent-sdk evalbench-import and bq-agent-sdk evalbench-score exist and work end-to-end against a real EvalBench job.
docs/evalbench.md documents the mirror-table contract with file/line references back to the queries that depend on it.
One example under examples/ runs end-to-end against datasets/gemini-cli-tools/ from EvalBench.
Importer handles missing tool_call and final_response fields gracefully; missing nl_prompt is a hard error.
EvalBench has a docs-only example showing how to enable BigQuery reporting and then run the BQAA evalbench-import / evalbench-score commands (cross-PR coordination with evalbench#357).
Coordination
Roadmap BigQuery Agent Analytics Roadmap #96 (BQAA Agent Analytics roadmap). This work is not yet on the roadmap; it lands well as a P1 follow-up under the Evaluation Platform track because:
It exercises Client.evaluate(..., dataset=...) and LLMAsJudge over an external trace source, which surfaces sharp edges before the unified evaluation_results schema lands.
It provides a second customer for TraceFilter(experiment_id=...), currently only used by the agent-improvement-cycle path.
Recommend adding it as a P1 row in the roadmap table after Quality Scorecard Phase 1 lands. Owner: Engineer 1 (Evaluation Platform track) once Scorecard Phase 1 is shipped, OR a contributor PR if community capacity shows up first.
Unified evaluation_results schema (BigQuery Agent Analytics Roadmap #96 P0 row 6). EvalBench scorer rows go to importer-owned evalbench_scores_imported for the MVP. Once evaluation_results lands, expose a compatibility view from evalbench_scores_imported into evaluation_results (separate follow-up issue).
EvalBench side (evalbench#357). Pure docs/example contribution for MVP. No EvalBench runtime code change required.
ADK plugin (google-adk-python). No coordination required; this issue explicitly avoids writing to the plugin's agent_events table.
Open questions
Exact content JSON key the prebuilt LLM judges read for the final response.Resolved: content.$.response (evaluators.py:879, mirrored at line 982). _AI_GENERATE_JUDGE_BATCH_QUERY_TEMPLATE picks the most recent non-null JSON_VALUE(content, '$.response') as final_response. Trace extraction also accepts content.text_summary as a fallback.
Mirror-table partitioning. Should evalbench_agent_events be partitioned by DATE(timestamp) (matching the ADK plugin's agent_events)? Probably yes for query performance, but increases setup cost. Recommend default-on with a --no-partition opt-out flag.
write_disposition default and duplicate handling. Repeated imports of the same job_id under naive WRITE_APPEND create duplicate rows that inflate scorecards and break aggregations. The MVP must avoid that. Two acceptable shapes:
(a) Idempotent WRITE_APPEND (recommended default). Before insert, delete rows where JSON_VALUE(attributes, '$.experiment_id') = @job_id. Lets users accumulate distinct jobs in one table without manual cleanup.
(b) WRITE_TRUNCATE. Full reset; only safe when one job per table.
Recommend default = (a). Document --write-disposition WRITE_TRUNCATE for users who want full-reset semantics. Either way, never silently produce duplicates — that is the actual bug to avoid, not the disposition flag itself.
Should Client.evaluate(...) learn an evalbench_run filter shorthand? E.g. filters=TraceFilter(evalbench_job_id=...) rather than the generic experiment_id. Probably no — keep experiment_id as the canonical field and document job_id ↔ experiment_id mapping. Out of scope for this issue, but worth noting.
Mirror-table contract changes must track BQAA query changes. If _GET_TRACE_QUERY, SESSION_SUMMARY_QUERY, or _AI_GENERATE_JUDGE_BATCH_QUERY_TEMPLATE add columns later, the importer schema and tests need updating in lockstep. Suggest a code comment on those queries pointing back to the importer.
Should Client.get_session_trace(...) learn a dataset= override? Currently it uses self.table_id only (client.py:788–791). The MVP works fine by setting Client(table_id="evalbench_agent_events"), but a dataset= parameter would mirror Client.evaluate(...) and let one client read both production agent_events and a mirror table. Out of scope for this issue, but worth a follow-up.
Related
evalbench#357 — EvalBench-side framing of the bridge.
Add EvalBench bridge: import EvalBench BigQuery runs into a BQAA-owned mirror table
Cross-repo context
This is the BQAA-side counterpart to GoogleCloudPlatform/evalbench#357.
EvalBench is the agent benchmark harness. It writes scenario results to
evalbench.resultsandevalbench.scoresin BigQuery via its existingreporting/bqstore.py. EvalBench answers "can the agent complete this scenario?"BQAA is the trace warehouse and semantic evaluation layer. It answers "what happened inside the run, and how does quality trend over runs?"
This issue tracks the work to let BQAA read EvalBench's existing BigQuery output and apply BQAA's evaluation surface (
LLMAsJudge, scorecards, CI gates) over those rows — without coupling EvalBench to the ADK plugin'sagent_eventsschema.The full architectural rationale (pull-first vs push-first, two-direction bridge framing, EvalBench-side phasing) lives in evalbench#357. This issue scopes the SDK-side implementation only.
Why this lives in BQAA
Client.evaluate(...),LLMAsJudge.hallucination/correctness/sentiment,TraceFilter) and BigQuery query plumbing (location, project_id, dataset_id, labels). Adding an importer is additive.agent_eventstable.Scope
In scope (this issue)
bigquery_agent_analytics.evalbenchwith anEvalBenchRunclass.evalbench-importandevalbench-score.Client.get_session_trace(...)andClient.evaluate(LLMAsJudge.hallucination(...), ...)work against the mirror table.datasets/gemini-cli-tools/from EvalBench.evalbench_scores_importedtable for EvalBench scorer rows (until unifiedevaluation_resultslands; see BigQuery Agent Analytics Roadmap #96).Out of scope (this issue)
reporting/bqaa.pyin EvalBench). Tracked as a deferred phase in evalbench#357.LLMRater mode: ai_generate). Tracked in evalbench#357 Phase 5.evalbench_scores_importedinto unifiedevaluation_results. Follow-up once BigQuery Agent Analytics Roadmap #96 lands the schema.Client.evaluate(...)itself. The importer should make existing APIs work, not change them.Module:
bigquery_agent_analytics.evalbenchNew optional submodule. Importing it should not require new runtime dependencies beyond what BQAA already pulls in (
google-cloud-bigquery,google-genai).Mirror-table contract (authoritative)
The mirror table must satisfy BQAA's existing trace and evaluation queries. Verified against:
src/bigquery_agent_analytics/client.py:118–139(_GET_TRACE_QUERY)src/bigquery_agent_analytics/client.py:141–170(_LIST_TRACES_QUERY)src/bigquery_agent_analytics/evaluators.py:786+(SESSION_SUMMARY_QUERY)src/bigquery_agent_analytics/trace.py:592–600(TraceFilter.to_sql_conditions)Required top-level columns
session_idClient.get_session_trace(session_id)and trace groupingevent_typeUSER_MESSAGE_RECEIVED/LLM_RESPONSE/AGENT_COMPLETED/TOOL_STARTING/TOOL_COMPLETEDtimestampstart_time/end_timefiltersagentagent_idfilterinvocation_id_GET_TRACE_QUERYprojection (NULL acceptable)trace_id_GET_TRACE_QUERYprojection (NULL acceptable)span_id_GET_TRACE_QUERYprojection (NULL acceptable)parent_span_id_GET_TRACE_QUERYprojection (NULL acceptable)user_iduser_idfiltercontent$.usage.prompt)content_parts_GET_TRACE_QUERYprojection (empty array acceptable)attributesexperiment_idis read asJSON_VALUE(attributes, '$.experiment_id')(trace.py:594). Also holdsusage_metadata.prompt_token_count,input_tokens.latency_msSESSION_SUMMARY_QUERYreads$.total_msand$.time_to_first_token_msstatushas_errorfilter, error-rate aggregationerror_messagehas_errorfilteris_truncated_GET_TRACE_QUERYprojection (FALSE acceptable)Implementation traps
content.text_summaryis required on every row that should appear in judge trace text. This is the single most important non-obvious requirement._AI_GENERATE_JUDGE_BATCH_QUERY_TEMPLATE(evaluators.py:865–887, and the parallel ML.GENERATE_TEXT path at line 976) buildstrace_textviaSTRING_AGG(CONCAT(event_type, ': ', COALESCE(JSON_VALUE(content, '$.text_summary'), '')) ORDER BY timestamp)and then filters withHAVING LENGTH(trace_text) > 10. If imported rows omittext_summary,trace_textcollapses toevent_type:repeated and gets dropped before the judge ever sees it. The mirror table can look schema-valid whileClient.evaluate(LLMAsJudge.hallucination(...))evaluates zero sessions.experiment_idis not a top-level column. It must be written intoattributesJSON as{"experiment_id": "<job_id>", ...}soTraceFilter(experiment_id=...)works without code changes (trace.py:594).AGENT_COMPLETEDis the minimum viable final-response row. BQAA's trace evaluator checksLLM_RESPONSEfirst, then falls back toAGENT_COMPLETED(trace_evaluator.py:226–234). EmittingAGENT_COMPLETEDalone is sufficient; emittingLLM_RESPONSEtoo is fine but not required.content.$.response(resolved). The batch LLM judge picks the most recent non-nullJSON_VALUE(content, '$.response')asfinal_response(evaluators.py:879, mirrored atevaluators.py:982). Trace extraction also acceptscontent.text_summaryas a fallback.Synthetic row construction (per scenario)
Every row that should contribute to judge
trace_textmust populatecontent.text_summary. Without it the row is effectively invisible to the LLM judge query.USER_MESSAGE_RECEIVEDrow withcontent.text = nl_promptANDcontent.text_summary = nl_prompt.AGENT_COMPLETEDrow withcontent.response = generated_sql or final_responseANDcontent.text_summary = <same string, or a one-line digest>.LLM_RESPONSErow mirroringAGENT_COMPLETED(also populatingcontent.text_summary).TOOL_STARTING/TOOL_COMPLETEDpairs when EvalBench has tool-call data; populatecontent.text_summarywith"<tool_name>(<args>)"and"<tool_name> -> <result_or_error>"respectively.session_id = "evalbench:{job_id}:{scenario_id}".attributes.experiment_id = job_idandattributes.evalbench_scenario_id = scenario_id.agent = "evalbench:{orchestrator}:{generator}"(read from EvalBench'sexperiment_config).Sanity check for implementers. A correctly-imported scenario produces non-empty
trace_textunder this expression (the same one the judge query uses):with
LENGTH(trace_text) > 10. If this collapses to whitespace or to a string ofevent_type:prefixes,text_summaryis missing.EvalBench result schema variability
evalbench.resultsandevalbench.scoresare shared across NL2SQL and Gemini CLI / agentic flows. Not every row has the fields a semantic judge needs.datasets/gemini-cli-tools/in EvalBench hasgemini-cli.evalset.jsonandgemini-cli-fake.evalset.json).job_id), but applyingLLMAsJudge.hallucination/correctnessto them requires a renderable final response / trace text. Skip the syntheticAGENT_COMPLETEDrow gracefully when no usable response field is present.TOOL_*rows and continue.CLI surface
Two new commands matching the SDK's existing single-level hyphenated pattern (
categorical-eval,ontology-build, etc.).bq-agent-sdk evalbench-import \ --project-id source-project \ --evalbench-dataset evalbench \ --job-id abc123 \ --target-project target-project \ --target-dataset agent_analytics \ --target-table evalbench_agent_events \ --scores-table evalbench_scores_imported \ --location US \ [--write-disposition WRITE_APPEND|WRITE_TRUNCATE] bq-agent-sdk evalbench-score \ --project-id target-project \ --dataset-id agent_analytics \ --table-id evalbench_agent_events \ --job-id abc123 \ --evaluator hallucination \ --threshold 0.7 \ --location US \ [--strict] \ [--exit-code]Project flags. EvalBench source tables and BQAA mirror tables can live in different projects (common for cross-team setups where EvalBench writes to a benchmarks project and BQAA reads in an analytics project).
--target-projectdefaults to--project-idso the same-project case stays a one-flag invocation; pass it explicitly when sources and targets differ.evalbench-scoreonly needs--project-idbecause by that point the data lives in the target project.--evaluatoraccepts the prebuilt names:hallucination,correctness,sentiment(mapping toLLMAsJudge.hallucination(...)etc. atevaluators.py:704+).--exit-codereuses the existing CLI flag pattern fromevaluate --exit-code(cli.py:268) so EvalBench scores can gate CI.Python API example
Tests
Unit tests:
tests/test_evalbench_importer.pytest_synthetic_row_columns_match_get_trace_query()— assert rows can be projected by_GET_TRACE_QUERYwithout missing-column errors. Use a stubbed BigQuery client.test_synthetic_rows_populate_text_summary_for_judge_trace_text()— assert imported Gemini CLI rows produce non-emptytrace_textunder the sameSTRING_AGG(CONCAT(event_type, ': ', COALESCE(JSON_VALUE(content, '$.text_summary'), '')) ORDER BY timestamp)expression the LLM judge query uses, and thatLENGTH(trace_text) > 10. Catches the silent "schema-valid but judge sees zero sessions" failure mode.test_synthetic_rows_populate_response_for_final_response()— assert at least one row per session hasJSON_VALUE(content, '$.response') IS NOT NULL, matchingevaluators.py:879.test_experiment_id_in_attributes_json()— assertattributes.$.experiment_id == job_id.test_session_id_format()— assertevalbench:{job_id}:{scenario_id}.test_missing_tool_calls_does_not_fail()— NL2SQL rows.test_missing_final_response_skips_agent_completed_row()— NL2SQL rows without renderable response.test_nl_prompt_required()— hard failure when missing.test_repeated_import_is_idempotent_under_write_append()— runningmaterialize()twice with the samejob_idandWRITE_APPENDproduces the same row count as a single run; rows are deleted byattributes.experiment_idbefore insert.test_target_project_can_differ_from_source_project()—from_bigquery(project_id="src")+materialize(target_project="dst", ...)writes todst.<dataset>.<table>.Integration tests (gated on
BQAA_LIVE=1env var, same gate pattern as existing live tests):tests/integration/test_evalbench_live.pyClient.get_session_trace(...)— assert non-emptyTrace.Client.evaluate(LLMAsJudge.hallucination(...), filters=TraceFilter(experiment_id=job_id), dataset="evalbench_agent_events")— assert at least one per-session score, andreport.details["execution_mode"] in {"ai_generate", "ml_generate_text", "api_fallback"}.Docs
docs/evalbench.md— module overview, mirror-table contract, CLI reference.examples/evalbench_bridge_demo.py— end-to-end script: import → score → print report.SDK.mdlinking out todocs/evalbench.md.Effort breakdown (P0 = MVP)
EvalBenchRun.from_bigqueryreaderEvalBenchRun.materializewriter + mirror-table DDLevalbench-importCLIevalbench-scoreCLI (thin wrapper aroundClient.evaluate)evalbench_scores_importedschema + writerMVP total: 2–3 eng-weeks.
Acceptance criteria
bigquery_agent_analytics.evalbench.EvalBenchRun.from_bigquery(...)reads one EvalBenchjob_idfromevalbench.results+evalbench.scoresin any BigQuery location, and source queries filter byjob_idat the SQL level.EvalBenchRun.materialize(...)acceptstarget_projectseparate from the sourceproject_id(cross-project import) and writes synthetic agent-events rows into a BQAA-owned mirror table (defaultevalbench_agent_events) — never into the ADK plugin's productionagent_events.job_idare idempotent — no duplicate rows in the mirror table after the second import.job_id(inattributes.experiment_id), scenario id (insession_id), scorer name, score, generated output, final response or generated SQL, and error fields.Client(project_id=..., dataset_id=..., table_id="evalbench_agent_events", location=...).get_session_trace("evalbench:{job_id}:{scenario_id}")returns a non-emptyTraceagainst the mirror table. (get_session_trace()usesClient.table_iddirectly —client.py:788–791— and has nodataset=override, unlikeClient.evaluate(...).)Client.evaluate(LLMAsJudge.hallucination(threshold=0.7), filters=TraceFilter(experiment_id=job_id))returns anEvaluationReportwith at least one per-session score in live integration tests, withreport.details["execution_mode"] in {"ai_generate", "ml_generate_text", "api_fallback"}.trace_textunder the LLM judge'sSTRING_AGG(CONCAT(event_type, ': ', JSON_VALUE(content, '$.text_summary')))expression, withLENGTH(trace_text) > 10.bq-agent-sdk evalbench-importandbq-agent-sdk evalbench-scoreexist and work end-to-end against a real EvalBench job.docs/evalbench.mddocuments the mirror-table contract with file/line references back to the queries that depend on it.examples/runs end-to-end againstdatasets/gemini-cli-tools/from EvalBench.tool_callandfinal_responsefields gracefully; missingnl_promptis a hard error.evalbench-import/evalbench-scorecommands (cross-PR coordination with evalbench#357).Coordination
Roadmap BigQuery Agent Analytics Roadmap #96 (BQAA Agent Analytics roadmap). This work is not yet on the roadmap; it lands well as a P1 follow-up under the Evaluation Platform track because:
Client.evaluate(..., dataset=...)andLLMAsJudgeover an external trace source, which surfaces sharp edges before the unifiedevaluation_resultsschema lands.TraceFilter(experiment_id=...), currently only used by the agent-improvement-cycle path.Unified
evaluation_resultsschema (BigQuery Agent Analytics Roadmap #96 P0 row 6). EvalBench scorer rows go to importer-ownedevalbench_scores_importedfor the MVP. Onceevaluation_resultslands, expose a compatibility view fromevalbench_scores_importedintoevaluation_results(separate follow-up issue).EvalBench side (evalbench#357). Pure docs/example contribution for MVP. No EvalBench runtime code change required.
ADK plugin (
google-adk-python). No coordination required; this issue explicitly avoids writing to the plugin'sagent_eventstable.Open questions
ExactResolved:contentJSON key the prebuilt LLM judges read for the final response.content.$.response(evaluators.py:879, mirrored at line 982)._AI_GENERATE_JUDGE_BATCH_QUERY_TEMPLATEpicks the most recent non-nullJSON_VALUE(content, '$.response')asfinal_response. Trace extraction also acceptscontent.text_summaryas a fallback.Mirror-table partitioning. Should
evalbench_agent_eventsbe partitioned byDATE(timestamp)(matching the ADK plugin'sagent_events)? Probably yes for query performance, but increases setup cost. Recommend default-on with a--no-partitionopt-out flag.write_dispositiondefault and duplicate handling. Repeated imports of the samejob_idunder naiveWRITE_APPENDcreate duplicate rows that inflate scorecards and break aggregations. The MVP must avoid that. Two acceptable shapes:WRITE_APPEND(recommended default). Before insert, delete rows whereJSON_VALUE(attributes, '$.experiment_id') = @job_id. Lets users accumulate distinct jobs in one table without manual cleanup.WRITE_TRUNCATE. Full reset; only safe when one job per table.Recommend default = (a). Document
--write-disposition WRITE_TRUNCATEfor users who want full-reset semantics. Either way, never silently produce duplicates — that is the actual bug to avoid, not the disposition flag itself.Should
Client.evaluate(...)learn anevalbench_runfilter shorthand? E.g.filters=TraceFilter(evalbench_job_id=...)rather than the genericexperiment_id. Probably no — keepexperiment_idas the canonical field and documentjob_id ↔ experiment_idmapping. Out of scope for this issue, but worth noting.Mirror-table contract changes must track BQAA query changes. If
_GET_TRACE_QUERY,SESSION_SUMMARY_QUERY, or_AI_GENERATE_JUDGE_BATCH_QUERY_TEMPLATEadd columns later, the importer schema and tests need updating in lockstep. Suggest a code comment on those queries pointing back to the importer.Should
Client.get_session_trace(...)learn adataset=override? Currently it usesself.table_idonly (client.py:788–791). The MVP works fine by settingClient(table_id="evalbench_agent_events"), but adataset=parameter would mirrorClient.evaluate(...)and let one client read both productionagent_eventsand a mirror table. Out of scope for this issue, but worth a follow-up.Related