Skip to content

Add EvalBench bridge: import EvalBench BigQuery runs into a BQAA-owned mirror table #97

@caohy1988

Description

@caohy1988

Add EvalBench bridge: import EvalBench BigQuery runs into a BQAA-owned mirror table

Cross-repo context

This is the BQAA-side counterpart to GoogleCloudPlatform/evalbench#357.

EvalBench is the agent benchmark harness. It writes scenario results to evalbench.results and evalbench.scores in BigQuery via its existing reporting/bqstore.py. EvalBench answers "can the agent complete this scenario?"

BQAA is the trace warehouse and semantic evaluation layer. It answers "what happened inside the run, and how does quality trend over runs?"

This issue tracks the work to let BQAA read EvalBench's existing BigQuery output and apply BQAA's evaluation surface (LLMAsJudge, scorecards, CI gates) over those rows — without coupling EvalBench to the ADK plugin's agent_events schema.

The full architectural rationale (pull-first vs push-first, two-direction bridge framing, EvalBench-side phasing) lives in evalbench#357. This issue scopes the SDK-side implementation only.

Why this lives in BQAA

  • The work is read-only against EvalBench's existing BigQuery output. No EvalBench runtime code change is required for the MVP — only docs/examples on their side.
  • The SDK already has the evaluation surface (Client.evaluate(...), LLMAsJudge.hallucination/correctness/sentiment, TraceFilter) and BigQuery query plumbing (location, project_id, dataset_id, labels). Adding an importer is additive.
  • Pull-first avoids the schema-coupling problem of having EvalBench write to the ADK plugin's agent_events table.

Scope

In scope (this issue)

  • A new optional submodule bigquery_agent_analytics.evalbench with an EvalBenchRun class.
  • A BQAA-owned mirror-table writer (synthetic agent_events rows materialized into a separate, importer-owned table).
  • Two CLI commands: evalbench-import and evalbench-score.
  • Tests proving Client.get_session_trace(...) and Client.evaluate(LLMAsJudge.hallucination(...), ...) work against the mirror table.
  • One end-to-end docs example using datasets/gemini-cli-tools/ from EvalBench.
  • Importer-owned evalbench_scores_imported table for EvalBench scorer rows (until unified evaluation_results lands; see BigQuery Agent Analytics Roadmap #96).

Out of scope (this issue)

  • Push-based EvalBench reporter (reporting/bqaa.py in EvalBench). Tracked as a deferred phase in evalbench#357.
  • Direction B (EvalBench reads BQAA agent_events as eval input, or LLMRater mode: ai_generate). Tracked in evalbench#357 Phase 5.
  • Compatibility view from evalbench_scores_imported into unified evaluation_results. Follow-up once BigQuery Agent Analytics Roadmap #96 lands the schema.
  • Modifying Client.evaluate(...) itself. The importer should make existing APIs work, not change them.

Module: bigquery_agent_analytics.evalbench

New optional submodule. Importing it should not require new runtime dependencies beyond what BQAA already pulls in (google-cloud-bigquery, google-genai).

# src/bigquery_agent_analytics/evalbench.py

from dataclasses import dataclass
from typing import Optional

@dataclass
class EvalBenchRun:
    """One EvalBench job materialized as importable BQAA trace rows."""
    project_id: str
    evalbench_dataset: str
    job_id: str
    location: Optional[str] = None  # matches EvalBench BigQueryReporter dataset_location

    @classmethod
    def from_bigquery(
        cls,
        *,
        project_id: str,
        evalbench_dataset: str,
        job_id: str,
        location: Optional[str] = None,
    ) -> "EvalBenchRun":
        """Loads the run's results + scores from BigQuery into memory.

        MVP scope: one job_id at a time, loaded into memory. Source queries
        MUST filter by job_id at the SQL level (WHERE job_id = @job_id) so
        scan cost stays bounded even when evalbench.results grows.

        Future: stream/page large runs without loading everything in memory.
        """
        ...

    def materialize(
        self,
        *,
        target_project: str,
        target_dataset: str,
        target_table: str = "evalbench_agent_events",
        scores_table: str = "evalbench_scores_imported",
        write_disposition: str = "WRITE_APPEND",
    ) -> "MaterializeResult":
        """Writes synthetic BQAA trace rows + scorer rows into BQAA-owned mirror tables.

        Must NOT write into the ADK plugin's production agent_events table.

        Idempotency: WRITE_APPEND will produce duplicate rows on repeated
        imports of the same job_id unless the writer first deletes existing
        rows where attributes.experiment_id = job_id. MVP behavior:
          - WRITE_APPEND  → delete-by-experiment_id before insert (idempotent)
          - WRITE_TRUNCATE → drop and re-create the table (idempotent, full reset)
        See open question #3 for the recommended CI default.
        """
        ...

Mirror-table contract (authoritative)

The mirror table must satisfy BQAA's existing trace and evaluation queries. Verified against:

  • src/bigquery_agent_analytics/client.py:118–139 (_GET_TRACE_QUERY)
  • src/bigquery_agent_analytics/client.py:141–170 (_LIST_TRACES_QUERY)
  • src/bigquery_agent_analytics/evaluators.py:786+ (SESSION_SUMMARY_QUERY)
  • src/bigquery_agent_analytics/trace.py:592–600 (TraceFilter.to_sql_conditions)

Required top-level columns

Column Type Why required
session_id STRING Client.get_session_trace(session_id) and trace grouping
event_type STRING USER_MESSAGE_RECEIVED / LLM_RESPONSE / AGENT_COMPLETED / TOOL_STARTING / TOOL_COMPLETED
timestamp TIMESTAMP Trace ordering, start_time / end_time filters
agent STRING agent_id filter
invocation_id STRING _GET_TRACE_QUERY projection (NULL acceptable)
trace_id STRING _GET_TRACE_QUERY projection (NULL acceptable)
span_id STRING _GET_TRACE_QUERY projection (NULL acceptable)
parent_span_id STRING _GET_TRACE_QUERY projection (NULL acceptable)
user_id STRING user_id filter
content JSON Final-response text + token-usage fallback ($.usage.prompt)
content_parts JSON / ARRAY _GET_TRACE_QUERY projection (empty array acceptable)
attributes JSON experiment_id is read as JSON_VALUE(attributes, '$.experiment_id') (trace.py:594). Also holds usage_metadata.prompt_token_count, input_tokens.
latency_ms JSON SESSION_SUMMARY_QUERY reads $.total_ms and $.time_to_first_token_ms
status STRING has_error filter, error-rate aggregation
error_message STRING has_error filter
is_truncated BOOL _GET_TRACE_QUERY projection (FALSE acceptable)

Implementation traps

  • content.text_summary is required on every row that should appear in judge trace text. This is the single most important non-obvious requirement. _AI_GENERATE_JUDGE_BATCH_QUERY_TEMPLATE (evaluators.py:865–887, and the parallel ML.GENERATE_TEXT path at line 976) builds trace_text via STRING_AGG(CONCAT(event_type, ': ', COALESCE(JSON_VALUE(content, '$.text_summary'), '')) ORDER BY timestamp) and then filters with HAVING LENGTH(trace_text) > 10. If imported rows omit text_summary, trace_text collapses to event_type: repeated and gets dropped before the judge ever sees it. The mirror table can look schema-valid while Client.evaluate(LLMAsJudge.hallucination(...)) evaluates zero sessions.
  • experiment_id is not a top-level column. It must be written into attributes JSON as {"experiment_id": "<job_id>", ...} so TraceFilter(experiment_id=...) works without code changes (trace.py:594).
  • AGENT_COMPLETED is the minimum viable final-response row. BQAA's trace evaluator checks LLM_RESPONSE first, then falls back to AGENT_COMPLETED (trace_evaluator.py:226–234). Emitting AGENT_COMPLETED alone is sufficient; emitting LLM_RESPONSE too is fine but not required.
  • Final-response JSON key is content.$.response (resolved). The batch LLM judge picks the most recent non-null JSON_VALUE(content, '$.response') as final_response (evaluators.py:879, mirrored at evaluators.py:982). Trace extraction also accepts content.text_summary as a fallback.

Synthetic row construction (per scenario)

Every row that should contribute to judge trace_text must populate content.text_summary. Without it the row is effectively invisible to the LLM judge query.

  • One USER_MESSAGE_RECEIVED row with content.text = nl_prompt AND content.text_summary = nl_prompt.
  • One AGENT_COMPLETED row with content.response = generated_sql or final_response AND content.text_summary = <same string, or a one-line digest>.
  • Optional LLM_RESPONSE row mirroring AGENT_COMPLETED (also populating content.text_summary).
  • Zero or more TOOL_STARTING / TOOL_COMPLETED pairs when EvalBench has tool-call data; populate content.text_summary with "<tool_name>(<args>)" and "<tool_name> -> <result_or_error>" respectively.
  • All rows share session_id = "evalbench:{job_id}:{scenario_id}".
  • All rows have attributes.experiment_id = job_id and attributes.evalbench_scenario_id = scenario_id.
  • All rows have agent = "evalbench:{orchestrator}:{generator}" (read from EvalBench's experiment_config).

Sanity check for implementers. A correctly-imported scenario produces non-empty trace_text under this expression (the same one the judge query uses):

STRING_AGG(
  CONCAT(event_type, ': ', COALESCE(JSON_VALUE(content, '$.text_summary'), '')),
  '\n' ORDER BY timestamp
) AS trace_text

with LENGTH(trace_text) > 10. If this collapses to whitespace or to a string of event_type: prefixes, text_summary is missing.

EvalBench result schema variability

evalbench.results and evalbench.scores are shared across NL2SQL and Gemini CLI / agentic flows. Not every row has the fields a semantic judge needs.

  • Gemini CLI / agentic rows are the primary semantic-judge target (datasets/gemini-cli-tools/ in EvalBench has gemini-cli.evalset.json and gemini-cli-fake.evalset.json).
  • NL2SQL rows can be imported for reporting joins (Looker view joining EvalBench scores with BQAA scorecard outputs by job_id), but applying LLMAsJudge.hallucination/correctness to them requires a renderable final response / trace text. Skip the synthetic AGENT_COMPLETED row gracefully when no usable response field is present.
  • Missing tool-call data must not fail the import. Emit zero TOOL_* rows and continue.
  • Missing nl_prompt is a hard failure — every imported scenario needs a user message.

CLI surface

Two new commands matching the SDK's existing single-level hyphenated pattern (categorical-eval, ontology-build, etc.).

bq-agent-sdk evalbench-import \
  --project-id source-project \
  --evalbench-dataset evalbench \
  --job-id abc123 \
  --target-project target-project \
  --target-dataset agent_analytics \
  --target-table evalbench_agent_events \
  --scores-table evalbench_scores_imported \
  --location US \
  [--write-disposition WRITE_APPEND|WRITE_TRUNCATE]

bq-agent-sdk evalbench-score \
  --project-id target-project \
  --dataset-id agent_analytics \
  --table-id evalbench_agent_events \
  --job-id abc123 \
  --evaluator hallucination \
  --threshold 0.7 \
  --location US \
  [--strict] \
  [--exit-code]

Project flags. EvalBench source tables and BQAA mirror tables can live in different projects (common for cross-team setups where EvalBench writes to a benchmarks project and BQAA reads in an analytics project). --target-project defaults to --project-id so the same-project case stays a one-flag invocation; pass it explicitly when sources and targets differ. evalbench-score only needs --project-id because by that point the data lives in the target project.

--evaluator accepts the prebuilt names: hallucination, correctness, sentiment (mapping to LLMAsJudge.hallucination(...) etc. at evaluators.py:704+).

--exit-code reuses the existing CLI flag pattern from evaluate --exit-code (cli.py:268) so EvalBench scores can gate CI.

Python API example

from bigquery_agent_analytics import Client
from bigquery_agent_analytics.evaluators import LLMAsJudge
from bigquery_agent_analytics.trace import TraceFilter
from bigquery_agent_analytics.evalbench import EvalBenchRun

run = EvalBenchRun.from_bigquery(
    project_id="my-project",
    evalbench_dataset="evalbench",
    job_id="abc123",
    location="US",
)

result = run.materialize(
    target_project="my-project",
    target_dataset="agent_analytics",
    target_table="evalbench_agent_events",
)
print(f"Imported {result.session_count} sessions, {result.score_count} scores")

client = Client(
    project_id="my-project",
    dataset_id="agent_analytics",
    table_id="evalbench_agent_events",  # required so get_session_trace() reads the mirror table
    location="US",
)

# get_session_trace() uses Client.table_id directly (client.py:788–791); it has no dataset override.
trace = client.get_session_trace("evalbench:abc123:scenario-1")
print(f"{len(trace.spans)} spans for scenario-1")

# evaluate() does accept a dataset (table-name) override, but with table_id already set on the
# client we can omit it and the call still targets the mirror table.
report = client.evaluate(
    evaluator=LLMAsJudge.hallucination(threshold=0.7),
    filters=TraceFilter(experiment_id="abc123"),
    strict=True,
)
print(report.summary())

Tests

Unit tests:

  • tests/test_evalbench_importer.py
    • test_synthetic_row_columns_match_get_trace_query() — assert rows can be projected by _GET_TRACE_QUERY without missing-column errors. Use a stubbed BigQuery client.
    • test_synthetic_rows_populate_text_summary_for_judge_trace_text() — assert imported Gemini CLI rows produce non-empty trace_text under the same STRING_AGG(CONCAT(event_type, ': ', COALESCE(JSON_VALUE(content, '$.text_summary'), '')) ORDER BY timestamp) expression the LLM judge query uses, and that LENGTH(trace_text) > 10. Catches the silent "schema-valid but judge sees zero sessions" failure mode.
    • test_synthetic_rows_populate_response_for_final_response() — assert at least one row per session has JSON_VALUE(content, '$.response') IS NOT NULL, matching evaluators.py:879.
    • test_experiment_id_in_attributes_json() — assert attributes.$.experiment_id == job_id.
    • test_session_id_format() — assert evalbench:{job_id}:{scenario_id}.
    • test_missing_tool_calls_does_not_fail() — NL2SQL rows.
    • test_missing_final_response_skips_agent_completed_row() — NL2SQL rows without renderable response.
    • test_nl_prompt_required() — hard failure when missing.
    • test_repeated_import_is_idempotent_under_write_append() — running materialize() twice with the same job_id and WRITE_APPEND produces the same row count as a single run; rows are deleted by attributes.experiment_id before insert.
    • test_target_project_can_differ_from_source_project()from_bigquery(project_id="src") + materialize(target_project="dst", ...) writes to dst.<dataset>.<table>.

Integration tests (gated on BQAA_LIVE=1 env var, same gate pattern as existing live tests):

  • tests/integration/test_evalbench_live.py
    • Materialize a fixture EvalBench job into a temp dataset.
    • Call Client.get_session_trace(...) — assert non-empty Trace.
    • Call Client.evaluate(LLMAsJudge.hallucination(...), filters=TraceFilter(experiment_id=job_id), dataset="evalbench_agent_events") — assert at least one per-session score, and report.details["execution_mode"] in {"ai_generate", "ml_generate_text", "api_fallback"}.

Docs

  • docs/evalbench.md — module overview, mirror-table contract, CLI reference.
  • examples/evalbench_bridge_demo.py — end-to-end script: import → score → print report.
  • One-paragraph addition to SDK.md linking out to docs/evalbench.md.

Effort breakdown (P0 = MVP)

Priority Item Effort
P0 Design mapping doc: EvalBench rows → synthetic BQAA trace rows 0.5 wk
P0 EvalBenchRun.from_bigquery reader 0.5 wk
P0 EvalBenchRun.materialize writer + mirror-table DDL 0.5–1 wk
P0 evalbench-import CLI 0.25 wk
P0 evalbench-score CLI (thin wrapper around Client.evaluate) 0.25 wk
P0 Unit + integration tests 0.5 wk
P0 Docs + Gemini CLI live demo 0.5 wk
P1 evalbench_scores_imported schema + writer 0.5 wk
P1 Looker view template joining EvalBench scores with BQAA scorecard outputs 0.5 wk

MVP total: 2–3 eng-weeks.

Acceptance criteria

  • bigquery_agent_analytics.evalbench.EvalBenchRun.from_bigquery(...) reads one EvalBench job_id from evalbench.results + evalbench.scores in any BigQuery location, and source queries filter by job_id at the SQL level.
  • EvalBenchRun.materialize(...) accepts target_project separate from the source project_id (cross-project import) and writes synthetic agent-events rows into a BQAA-owned mirror table (default evalbench_agent_events) — never into the ADK plugin's production agent_events.
  • Repeated imports of the same job_id are idempotent — no duplicate rows in the mirror table after the second import.
  • Imported rows preserve job_id (in attributes.experiment_id), scenario id (in session_id), scorer name, score, generated output, final response or generated SQL, and error fields.
  • Client(project_id=..., dataset_id=..., table_id="evalbench_agent_events", location=...).get_session_trace("evalbench:{job_id}:{scenario_id}") returns a non-empty Trace against the mirror table. (get_session_trace() uses Client.table_id directly — client.py:788–791 — and has no dataset= override, unlike Client.evaluate(...).)
  • Client.evaluate(LLMAsJudge.hallucination(threshold=0.7), filters=TraceFilter(experiment_id=job_id)) returns an EvaluationReport with at least one per-session score in live integration tests, with report.details["execution_mode"] in {"ai_generate", "ml_generate_text", "api_fallback"}.
  • Imported rows produce non-empty trace_text under the LLM judge's STRING_AGG(CONCAT(event_type, ': ', JSON_VALUE(content, '$.text_summary'))) expression, with LENGTH(trace_text) > 10.
  • bq-agent-sdk evalbench-import and bq-agent-sdk evalbench-score exist and work end-to-end against a real EvalBench job.
  • docs/evalbench.md documents the mirror-table contract with file/line references back to the queries that depend on it.
  • One example under examples/ runs end-to-end against datasets/gemini-cli-tools/ from EvalBench.
  • Importer handles missing tool_call and final_response fields gracefully; missing nl_prompt is a hard error.
  • EvalBench has a docs-only example showing how to enable BigQuery reporting and then run the BQAA evalbench-import / evalbench-score commands (cross-PR coordination with evalbench#357).

Coordination

  • Roadmap BigQuery Agent Analytics Roadmap #96 (BQAA Agent Analytics roadmap). This work is not yet on the roadmap; it lands well as a P1 follow-up under the Evaluation Platform track because:

    • It exercises Client.evaluate(..., dataset=...) and LLMAsJudge over an external trace source, which surfaces sharp edges before the unified evaluation_results schema lands.
    • It provides a second customer for TraceFilter(experiment_id=...), currently only used by the agent-improvement-cycle path.
    • Recommend adding it as a P1 row in the roadmap table after Quality Scorecard Phase 1 lands. Owner: Engineer 1 (Evaluation Platform track) once Scorecard Phase 1 is shipped, OR a contributor PR if community capacity shows up first.
  • Unified evaluation_results schema (BigQuery Agent Analytics Roadmap #96 P0 row 6). EvalBench scorer rows go to importer-owned evalbench_scores_imported for the MVP. Once evaluation_results lands, expose a compatibility view from evalbench_scores_imported into evaluation_results (separate follow-up issue).

  • EvalBench side (evalbench#357). Pure docs/example contribution for MVP. No EvalBench runtime code change required.

  • ADK plugin (google-adk-python). No coordination required; this issue explicitly avoids writing to the plugin's agent_events table.

Open questions

  1. Exact content JSON key the prebuilt LLM judges read for the final response. Resolved: content.$.response (evaluators.py:879, mirrored at line 982). _AI_GENERATE_JUDGE_BATCH_QUERY_TEMPLATE picks the most recent non-null JSON_VALUE(content, '$.response') as final_response. Trace extraction also accepts content.text_summary as a fallback.

  2. Mirror-table partitioning. Should evalbench_agent_events be partitioned by DATE(timestamp) (matching the ADK plugin's agent_events)? Probably yes for query performance, but increases setup cost. Recommend default-on with a --no-partition opt-out flag.

  3. write_disposition default and duplicate handling. Repeated imports of the same job_id under naive WRITE_APPEND create duplicate rows that inflate scorecards and break aggregations. The MVP must avoid that. Two acceptable shapes:

    • (a) Idempotent WRITE_APPEND (recommended default). Before insert, delete rows where JSON_VALUE(attributes, '$.experiment_id') = @job_id. Lets users accumulate distinct jobs in one table without manual cleanup.
    • (b) WRITE_TRUNCATE. Full reset; only safe when one job per table.

    Recommend default = (a). Document --write-disposition WRITE_TRUNCATE for users who want full-reset semantics. Either way, never silently produce duplicates — that is the actual bug to avoid, not the disposition flag itself.

  4. Should Client.evaluate(...) learn an evalbench_run filter shorthand? E.g. filters=TraceFilter(evalbench_job_id=...) rather than the generic experiment_id. Probably no — keep experiment_id as the canonical field and document job_id ↔ experiment_id mapping. Out of scope for this issue, but worth noting.

  5. Mirror-table contract changes must track BQAA query changes. If _GET_TRACE_QUERY, SESSION_SUMMARY_QUERY, or _AI_GENERATE_JUDGE_BATCH_QUERY_TEMPLATE add columns later, the importer schema and tests need updating in lockstep. Suggest a code comment on those queries pointing back to the importer.

  6. Should Client.get_session_trace(...) learn a dataset= override? Currently it uses self.table_id only (client.py:788–791). The MVP works fine by setting Client(table_id="evalbench_agent_events"), but a dataset= parameter would mirror Client.evaluate(...) and let one client read both production agent_events and a mirror table. Out of scope for this issue, but worth a follow-up.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions