Add EvalBench bridge: import EvalBench BigQuery runs into a BQAA-owned mirror table

# Add EvalBench bridge: import EvalBench BigQuery runs into a BQAA-owned mirror table

## Cross-repo context

This is the BQAA-side counterpart to [GoogleCloudPlatform/evalbench#357](https://github.com/GoogleCloudPlatform/evalbench/issues/357).

EvalBench is the agent benchmark harness. It writes scenario results to `evalbench.results` and `evalbench.scores` in BigQuery via its existing `reporting/bqstore.py`. EvalBench answers *"can the agent complete this scenario?"*

BQAA is the trace warehouse and semantic evaluation layer. It answers *"what happened inside the run, and how does quality trend over runs?"*

This issue tracks the work to let BQAA **read EvalBench's existing BigQuery output** and apply BQAA's evaluation surface (`LLMAsJudge`, scorecards, CI gates) over those rows — without coupling EvalBench to the ADK plugin's `agent_events` schema.

The full architectural rationale (pull-first vs push-first, two-direction bridge framing, EvalBench-side phasing) lives in [evalbench#357](https://github.com/GoogleCloudPlatform/evalbench/issues/357). This issue scopes the **SDK-side implementation only**.

## Why this lives in BQAA

- The work is read-only against EvalBench's existing BigQuery output. No EvalBench runtime code change is required for the MVP — only docs/examples on their side.
- The SDK already has the evaluation surface (`Client.evaluate(...)`, `LLMAsJudge.hallucination/correctness/sentiment`, `TraceFilter`) and BigQuery query plumbing (location, project_id, dataset_id, labels). Adding an importer is additive.
- Pull-first avoids the schema-coupling problem of having EvalBench write to the ADK plugin's `agent_events` table.

## Scope

### In scope (this issue)

- A new optional submodule `bigquery_agent_analytics.evalbench` with an `EvalBenchRun` class.
- A BQAA-owned mirror-table writer (synthetic agent_events rows materialized into a separate, importer-owned table).
- Two CLI commands: `evalbench-import` and `evalbench-score`.
- Tests proving `Client.get_session_trace(...)` and `Client.evaluate(LLMAsJudge.hallucination(...), ...)` work against the mirror table.
- One end-to-end docs example using `datasets/gemini-cli-tools/` from EvalBench.
- Importer-owned `evalbench_scores_imported` table for EvalBench scorer rows (until unified `evaluation_results` lands; see #96).

### Out of scope (this issue)

- **Push-based EvalBench reporter** (`reporting/bqaa.py` in EvalBench). Tracked as a deferred phase in evalbench#357.
- **Direction B** (EvalBench reads BQAA agent_events as eval input, or `LLMRater mode: ai_generate`). Tracked in evalbench#357 Phase 5.
- **Compatibility view from `evalbench_scores_imported` into unified `evaluation_results`.** Follow-up once #96 lands the schema.
- Modifying `Client.evaluate(...)` itself. The importer should make existing APIs work, not change them.

## Module: `bigquery_agent_analytics.evalbench`

New optional submodule. Importing it should not require new runtime dependencies beyond what BQAA already pulls in (`google-cloud-bigquery`, `google-genai`).

```python
# src/bigquery_agent_analytics/evalbench.py

from dataclasses import dataclass
from typing import Optional

@dataclass
class EvalBenchRun:
    """One EvalBench job materialized as importable BQAA trace rows."""
    project_id: str
    evalbench_dataset: str
    job_id: str
    location: Optional[str] = None  # matches EvalBench BigQueryReporter dataset_location

    @classmethod
    def from_bigquery(
        cls,
        *,
        project_id: str,
        evalbench_dataset: str,
        job_id: str,
        location: Optional[str] = None,
    ) -> "EvalBenchRun":
        """Loads the run's results + scores from BigQuery into memory.

        MVP scope: one job_id at a time, loaded into memory. Source queries
        MUST filter by job_id at the SQL level (WHERE job_id = @job_id) so
        scan cost stays bounded even when evalbench.results grows.

        Future: stream/page large runs without loading everything in memory.
        """
        ...

    def materialize(
        self,
        *,
        target_project: str,
        target_dataset: str,
        target_table: str = "evalbench_agent_events",
        scores_table: str = "evalbench_scores_imported",
        write_disposition: str = "WRITE_APPEND",
    ) -> "MaterializeResult":
        """Writes synthetic BQAA trace rows + scorer rows into BQAA-owned mirror tables.

        Must NOT write into the ADK plugin's production agent_events table.

        Idempotency: WRITE_APPEND will produce duplicate rows on repeated
        imports of the same job_id unless the writer first deletes existing
        rows where attributes.experiment_id = job_id. MVP behavior:
          - WRITE_APPEND  → delete-by-experiment_id before insert (idempotent)
          - WRITE_TRUNCATE → drop and re-create the table (idempotent, full reset)
        See open question #3 for the recommended CI default.
        """
        ...
```

## Mirror-table contract (authoritative)

The mirror table must satisfy BQAA's existing trace and evaluation queries. Verified against:

- `src/bigquery_agent_analytics/client.py:118–139` (`_GET_TRACE_QUERY`)
- `src/bigquery_agent_analytics/client.py:141–170` (`_LIST_TRACES_QUERY`)
- `src/bigquery_agent_analytics/evaluators.py:786+` (`SESSION_SUMMARY_QUERY`)
- `src/bigquery_agent_analytics/trace.py:592–600` (`TraceFilter.to_sql_conditions`)

### Required top-level columns

| Column | Type | Why required |
|---|---|---|
| `session_id` | STRING | `Client.get_session_trace(session_id)` and trace grouping |
| `event_type` | STRING | `USER_MESSAGE_RECEIVED` / `LLM_RESPONSE` / `AGENT_COMPLETED` / `TOOL_STARTING` / `TOOL_COMPLETED` |
| `timestamp` | TIMESTAMP | Trace ordering, `start_time` / `end_time` filters |
| `agent` | STRING | `agent_id` filter |
| `invocation_id` | STRING | `_GET_TRACE_QUERY` projection (NULL acceptable) |
| `trace_id` | STRING | `_GET_TRACE_QUERY` projection (NULL acceptable) |
| `span_id` | STRING | `_GET_TRACE_QUERY` projection (NULL acceptable) |
| `parent_span_id` | STRING | `_GET_TRACE_QUERY` projection (NULL acceptable) |
| `user_id` | STRING | `user_id` filter |
| `content` | JSON | Final-response text + token-usage fallback (`$.usage.prompt`) |
| `content_parts` | JSON / ARRAY | `_GET_TRACE_QUERY` projection (empty array acceptable) |
| `attributes` | JSON | **`experiment_id` is read as `JSON_VALUE(attributes, '$.experiment_id')`** (`trace.py:594`). Also holds `usage_metadata.prompt_token_count`, `input_tokens`. |
| `latency_ms` | JSON | `SESSION_SUMMARY_QUERY` reads `$.total_ms` and `$.time_to_first_token_ms` |
| `status` | STRING | `has_error` filter, error-rate aggregation |
| `error_message` | STRING | `has_error` filter |
| `is_truncated` | BOOL | `_GET_TRACE_QUERY` projection (FALSE acceptable) |

### Implementation traps

- **`content.text_summary` is required on every row that should appear in judge trace text.** This is the single most important non-obvious requirement. `_AI_GENERATE_JUDGE_BATCH_QUERY_TEMPLATE` (`evaluators.py:865–887`, and the parallel ML.GENERATE_TEXT path at line 976) builds `trace_text` via `STRING_AGG(CONCAT(event_type, ': ', COALESCE(JSON_VALUE(content, '$.text_summary'), '')) ORDER BY timestamp)` and then **filters with `HAVING LENGTH(trace_text) > 10`**. If imported rows omit `text_summary`, `trace_text` collapses to `event_type: ` repeated and gets dropped before the judge ever sees it. The mirror table can look schema-valid while `Client.evaluate(LLMAsJudge.hallucination(...))` evaluates zero sessions.
- **`experiment_id` is not a top-level column.** It must be written into `attributes` JSON as `{"experiment_id": "<job_id>", ...}` so `TraceFilter(experiment_id=...)` works without code changes (`trace.py:594`).
- **`AGENT_COMPLETED` is the minimum viable final-response row.** BQAA's trace evaluator checks `LLM_RESPONSE` first, then falls back to `AGENT_COMPLETED` (`trace_evaluator.py:226–234`). Emitting `AGENT_COMPLETED` alone is sufficient; emitting `LLM_RESPONSE` too is fine but not required.
- **Final-response JSON key is `content.$.response`** (resolved). The batch LLM judge picks the most recent non-null `JSON_VALUE(content, '$.response')` as `final_response` (`evaluators.py:879`, mirrored at `evaluators.py:982`). Trace extraction also accepts `content.text_summary` as a fallback.

### Synthetic row construction (per scenario)

Every row that should contribute to judge `trace_text` must populate `content.text_summary`. Without it the row is effectively invisible to the LLM judge query.

- One `USER_MESSAGE_RECEIVED` row with `content.text = nl_prompt` AND `content.text_summary = nl_prompt`.
- One `AGENT_COMPLETED` row with `content.response = generated_sql or final_response` AND `content.text_summary = <same string, or a one-line digest>`.
- Optional `LLM_RESPONSE` row mirroring `AGENT_COMPLETED` (also populating `content.text_summary`).
- Zero or more `TOOL_STARTING` / `TOOL_COMPLETED` pairs when EvalBench has tool-call data; populate `content.text_summary` with `"<tool_name>(<args>)"` and `"<tool_name> -> <result_or_error>"` respectively.
- All rows share `session_id = "evalbench:{job_id}:{scenario_id}"`.
- All rows have `attributes.experiment_id = job_id` and `attributes.evalbench_scenario_id = scenario_id`.
- All rows have `agent = "evalbench:{orchestrator}:{generator}"` (read from EvalBench's `experiment_config`).

**Sanity check for implementers.** A correctly-imported scenario produces non-empty `trace_text` under this expression (the same one the judge query uses):

```sql
STRING_AGG(
  CONCAT(event_type, ': ', COALESCE(JSON_VALUE(content, '$.text_summary'), '')),
  '\n' ORDER BY timestamp
) AS trace_text
```

with `LENGTH(trace_text) > 10`. If this collapses to whitespace or to a string of `event_type:` prefixes, `text_summary` is missing.

## EvalBench result schema variability

`evalbench.results` and `evalbench.scores` are shared across NL2SQL and Gemini CLI / agentic flows. Not every row has the fields a semantic judge needs.

- **Gemini CLI / agentic rows** are the primary semantic-judge target (`datasets/gemini-cli-tools/` in EvalBench has `gemini-cli.evalset.json` and `gemini-cli-fake.evalset.json`).
- **NL2SQL rows** can be imported for reporting joins (Looker view joining EvalBench scores with BQAA scorecard outputs by `job_id`), but applying `LLMAsJudge.hallucination/correctness` to them requires a renderable final response / trace text. Skip the synthetic `AGENT_COMPLETED` row gracefully when no usable response field is present.
- **Missing tool-call data** must not fail the import. Emit zero `TOOL_*` rows and continue.
- **Missing nl_prompt** is a hard failure — every imported scenario needs a user message.

## CLI surface

Two new commands matching the SDK's existing single-level hyphenated pattern (`categorical-eval`, `ontology-build`, etc.).

```bash
bq-agent-sdk evalbench-import \
  --project-id source-project \
  --evalbench-dataset evalbench \
  --job-id abc123 \
  --target-project target-project \
  --target-dataset agent_analytics \
  --target-table evalbench_agent_events \
  --scores-table evalbench_scores_imported \
  --location US \
  [--write-disposition WRITE_APPEND|WRITE_TRUNCATE]

bq-agent-sdk evalbench-score \
  --project-id target-project \
  --dataset-id agent_analytics \
  --table-id evalbench_agent_events \
  --job-id abc123 \
  --evaluator hallucination \
  --threshold 0.7 \
  --location US \
  [--strict] \
  [--exit-code]
```

**Project flags.** EvalBench source tables and BQAA mirror tables can live in different projects (common for cross-team setups where EvalBench writes to a benchmarks project and BQAA reads in an analytics project). `--target-project` defaults to `--project-id` so the same-project case stays a one-flag invocation; pass it explicitly when sources and targets differ. `evalbench-score` only needs `--project-id` because by that point the data lives in the target project.

`--evaluator` accepts the prebuilt names: `hallucination`, `correctness`, `sentiment` (mapping to `LLMAsJudge.hallucination(...)` etc. at `evaluators.py:704+`).

`--exit-code` reuses the existing CLI flag pattern from `evaluate --exit-code` (`cli.py:268`) so EvalBench scores can gate CI.

## Python API example

```python
from bigquery_agent_analytics import Client
from bigquery_agent_analytics.evaluators import LLMAsJudge
from bigquery_agent_analytics.trace import TraceFilter
from bigquery_agent_analytics.evalbench import EvalBenchRun

run = EvalBenchRun.from_bigquery(
    project_id="my-project",
    evalbench_dataset="evalbench",
    job_id="abc123",
    location="US",
)

result = run.materialize(
    target_project="my-project",
    target_dataset="agent_analytics",
    target_table="evalbench_agent_events",
)
print(f"Imported {result.session_count} sessions, {result.score_count} scores")

client = Client(
    project_id="my-project",
    dataset_id="agent_analytics",
    table_id="evalbench_agent_events",  # required so get_session_trace() reads the mirror table
    location="US",
)

# get_session_trace() uses Client.table_id directly (client.py:788–791); it has no dataset override.
trace = client.get_session_trace("evalbench:abc123:scenario-1")
print(f"{len(trace.spans)} spans for scenario-1")

# evaluate() does accept a dataset (table-name) override, but with table_id already set on the
# client we can omit it and the call still targets the mirror table.
report = client.evaluate(
    evaluator=LLMAsJudge.hallucination(threshold=0.7),
    filters=TraceFilter(experiment_id="abc123"),
    strict=True,
)
print(report.summary())
```

## Tests

Unit tests:

- `tests/test_evalbench_importer.py`
  - `test_synthetic_row_columns_match_get_trace_query()` — assert rows can be projected by `_GET_TRACE_QUERY` without missing-column errors. Use a stubbed BigQuery client.
  - `test_synthetic_rows_populate_text_summary_for_judge_trace_text()` — assert imported Gemini CLI rows produce non-empty `trace_text` under the same `STRING_AGG(CONCAT(event_type, ': ', COALESCE(JSON_VALUE(content, '$.text_summary'), '')) ORDER BY timestamp)` expression the LLM judge query uses, and that `LENGTH(trace_text) > 10`. Catches the silent "schema-valid but judge sees zero sessions" failure mode.
  - `test_synthetic_rows_populate_response_for_final_response()` — assert at least one row per session has `JSON_VALUE(content, '$.response') IS NOT NULL`, matching `evaluators.py:879`.
  - `test_experiment_id_in_attributes_json()` — assert `attributes.$.experiment_id == job_id`.
  - `test_session_id_format()` — assert `evalbench:{job_id}:{scenario_id}`.
  - `test_missing_tool_calls_does_not_fail()` — NL2SQL rows.
  - `test_missing_final_response_skips_agent_completed_row()` — NL2SQL rows without renderable response.
  - `test_nl_prompt_required()` — hard failure when missing.
  - `test_repeated_import_is_idempotent_under_write_append()` — running `materialize()` twice with the same `job_id` and `WRITE_APPEND` produces the same row count as a single run; rows are deleted by `attributes.experiment_id` before insert.
  - `test_target_project_can_differ_from_source_project()` — `from_bigquery(project_id="src")` + `materialize(target_project="dst", ...)` writes to `dst.<dataset>.<table>`.

Integration tests (gated on `BQAA_LIVE=1` env var, same gate pattern as existing live tests):

- `tests/integration/test_evalbench_live.py`
  - Materialize a fixture EvalBench job into a temp dataset.
  - Call `Client.get_session_trace(...)` — assert non-empty `Trace`.
  - Call `Client.evaluate(LLMAsJudge.hallucination(...), filters=TraceFilter(experiment_id=job_id), dataset="evalbench_agent_events")` — assert at least one per-session score, and `report.details["execution_mode"] in {"ai_generate", "ml_generate_text", "api_fallback"}`.

## Docs

- `docs/evalbench.md` — module overview, mirror-table contract, CLI reference.
- `examples/evalbench_bridge_demo.py` — end-to-end script: import → score → print report.
- One-paragraph addition to `SDK.md` linking out to `docs/evalbench.md`.

## Effort breakdown (P0 = MVP)

| Priority | Item | Effort |
|---|---|---:|
| P0 | Design mapping doc: EvalBench rows → synthetic BQAA trace rows | 0.5 wk |
| P0 | `EvalBenchRun.from_bigquery` reader | 0.5 wk |
| P0 | `EvalBenchRun.materialize` writer + mirror-table DDL | 0.5–1 wk |
| P0 | `evalbench-import` CLI | 0.25 wk |
| P0 | `evalbench-score` CLI (thin wrapper around `Client.evaluate`) | 0.25 wk |
| P0 | Unit + integration tests | 0.5 wk |
| P0 | Docs + Gemini CLI live demo | 0.5 wk |
| P1 | `evalbench_scores_imported` schema + writer | 0.5 wk |
| P1 | Looker view template joining EvalBench scores with BQAA scorecard outputs | 0.5 wk |

**MVP total: 2–3 eng-weeks.**

## Acceptance criteria

- [ ] `bigquery_agent_analytics.evalbench.EvalBenchRun.from_bigquery(...)` reads one EvalBench `job_id` from `evalbench.results` + `evalbench.scores` in any BigQuery location, and source queries filter by `job_id` at the SQL level.
- [ ] `EvalBenchRun.materialize(...)` accepts `target_project` separate from the source `project_id` (cross-project import) and writes synthetic agent-events rows into a BQAA-owned mirror table (default `evalbench_agent_events`) — never into the ADK plugin's production `agent_events`.
- [ ] Repeated imports of the same `job_id` are idempotent — no duplicate rows in the mirror table after the second import.
- [ ] Imported rows preserve `job_id` (in `attributes.experiment_id`), scenario id (in `session_id`), scorer name, score, generated output, final response or generated SQL, and error fields.
- [ ] `Client(project_id=..., dataset_id=..., table_id="evalbench_agent_events", location=...).get_session_trace("evalbench:{job_id}:{scenario_id}")` returns a non-empty `Trace` against the mirror table. (`get_session_trace()` uses `Client.table_id` directly — `client.py:788–791` — and has no `dataset=` override, unlike `Client.evaluate(...)`.)
- [ ] `Client.evaluate(LLMAsJudge.hallucination(threshold=0.7), filters=TraceFilter(experiment_id=job_id))` returns an `EvaluationReport` with at least one per-session score in live integration tests, with `report.details["execution_mode"] in {"ai_generate", "ml_generate_text", "api_fallback"}`.
- [ ] Imported rows produce non-empty `trace_text` under the LLM judge's `STRING_AGG(CONCAT(event_type, ': ', JSON_VALUE(content, '$.text_summary')))` expression, with `LENGTH(trace_text) > 10`.
- [ ] `bq-agent-sdk evalbench-import` and `bq-agent-sdk evalbench-score` exist and work end-to-end against a real EvalBench job.
- [ ] `docs/evalbench.md` documents the mirror-table contract with file/line references back to the queries that depend on it.
- [ ] One example under `examples/` runs end-to-end against `datasets/gemini-cli-tools/` from EvalBench.
- [ ] Importer handles missing `tool_call` and `final_response` fields gracefully; missing `nl_prompt` is a hard error.
- [ ] EvalBench has a docs-only example showing how to enable BigQuery reporting and then run the BQAA `evalbench-import` / `evalbench-score` commands (cross-PR coordination with evalbench#357).

## Coordination

- **Roadmap #96 (BQAA Agent Analytics roadmap).** This work is not yet on the roadmap; it lands well as a P1 follow-up under the Evaluation Platform track because:
  - It exercises `Client.evaluate(..., dataset=...)` and `LLMAsJudge` over an external trace source, which surfaces sharp edges before the unified `evaluation_results` schema lands.
  - It provides a second customer for `TraceFilter(experiment_id=...)`, currently only used by the agent-improvement-cycle path.
  - Recommend adding it as a P1 row in the roadmap table after Quality Scorecard Phase 1 lands. Owner: Engineer 1 (Evaluation Platform track) once Scorecard Phase 1 is shipped, OR a contributor PR if community capacity shows up first.

- **Unified `evaluation_results` schema (#96 P0 row 6).** EvalBench scorer rows go to importer-owned `evalbench_scores_imported` for the MVP. Once `evaluation_results` lands, expose a compatibility view from `evalbench_scores_imported` into `evaluation_results` (separate follow-up issue).

- **EvalBench side ([evalbench#357](https://github.com/GoogleCloudPlatform/evalbench/issues/357)).** Pure docs/example contribution for MVP. No EvalBench runtime code change required.

- **ADK plugin (`google-adk-python`).** No coordination required; this issue explicitly avoids writing to the plugin's `agent_events` table.

## Open questions

1. ~~**Exact `content` JSON key the prebuilt LLM judges read for the final response.**~~ **Resolved**: `content.$.response` (`evaluators.py:879`, mirrored at line 982). `_AI_GENERATE_JUDGE_BATCH_QUERY_TEMPLATE` picks the most recent non-null `JSON_VALUE(content, '$.response')` as `final_response`. Trace extraction also accepts `content.text_summary` as a fallback.
2. **Mirror-table partitioning.** Should `evalbench_agent_events` be partitioned by `DATE(timestamp)` (matching the ADK plugin's `agent_events`)? Probably yes for query performance, but increases setup cost. Recommend default-on with a `--no-partition` opt-out flag.
3. **`write_disposition` default and duplicate handling.** Repeated imports of the same `job_id` under naive `WRITE_APPEND` create duplicate rows that inflate scorecards and break aggregations. The MVP must avoid that. Two acceptable shapes:
   - **(a) Idempotent `WRITE_APPEND` (recommended default).** Before insert, delete rows where `JSON_VALUE(attributes, '$.experiment_id') = @job_id`. Lets users accumulate distinct jobs in one table without manual cleanup.
   - **(b) `WRITE_TRUNCATE`.** Full reset; only safe when one job per table.

   Recommend default = (a). Document `--write-disposition WRITE_TRUNCATE` for users who want full-reset semantics. Either way, **never silently produce duplicates** — that is the actual bug to avoid, not the disposition flag itself.
4. **Should `Client.evaluate(...)` learn an `evalbench_run` filter shorthand?** E.g. `filters=TraceFilter(evalbench_job_id=...)` rather than the generic `experiment_id`. Probably no — keep `experiment_id` as the canonical field and document `job_id ↔ experiment_id` mapping. Out of scope for this issue, but worth noting.
5. **Mirror-table contract changes** must track BQAA query changes. If `_GET_TRACE_QUERY`, `SESSION_SUMMARY_QUERY`, or `_AI_GENERATE_JUDGE_BATCH_QUERY_TEMPLATE` add columns later, the importer schema and tests need updating in lockstep. Suggest a code comment on those queries pointing back to the importer.
6. **Should `Client.get_session_trace(...)` learn a `dataset=` override?** Currently it uses `self.table_id` only (`client.py:788–791`). The MVP works fine by setting `Client(table_id="evalbench_agent_events")`, but a `dataset=` parameter would mirror `Client.evaluate(...)` and let one client read both production `agent_events` and a mirror table. Out of scope for this issue, but worth a follow-up.

## Related

- evalbench#357 — EvalBench-side framing of the bridge.
- #96 — BQAA Agent Analytics roadmap (where this should land as P1 once Scorecard Phase 1 ships).
- #51 — Medium blog series; an EvalBench bridge demo is a natural slot 5 or 6 candidate once shipped.




Column	Type	Why required
`session_id`	STRING	`Client.get_session_trace(session_id)` and trace grouping
`event_type`	STRING	`USER_MESSAGE_RECEIVED` / `LLM_RESPONSE` / `AGENT_COMPLETED` / `TOOL_STARTING` / `TOOL_COMPLETED`
`timestamp`	TIMESTAMP	Trace ordering, `start_time` / `end_time` filters
`agent`	STRING	`agent_id` filter
`invocation_id`	STRING	`_GET_TRACE_QUERY` projection (NULL acceptable)
`trace_id`	STRING	`_GET_TRACE_QUERY` projection (NULL acceptable)
`span_id`	STRING	`_GET_TRACE_QUERY` projection (NULL acceptable)
`parent_span_id`	STRING	`_GET_TRACE_QUERY` projection (NULL acceptable)
`user_id`	STRING	`user_id` filter
`content`	JSON	Final-response text + token-usage fallback (`$.usage.prompt`)
`content_parts`	JSON / ARRAY	`_GET_TRACE_QUERY` projection (empty array acceptable)
`attributes`	JSON	`experiment_id` is read as `JSON_VALUE(attributes, '$.experiment_id')` (`trace.py:594`). Also holds `usage_metadata.prompt_token_count`, `input_tokens`.
`latency_ms`	JSON	`SESSION_SUMMARY_QUERY` reads `$.total_ms` and `$.time_to_first_token_ms`
`status`	STRING	`has_error` filter, error-rate aggregation
`error_message`	STRING	`has_error` filter
`is_truncated`	BOOL	`_GET_TRACE_QUERY` projection (FALSE acceptable)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add EvalBench bridge: import EvalBench BigQuery runs into a BQAA-owned mirror table #97

Add EvalBench bridge: import EvalBench BigQuery runs into a BQAA-owned mirror table

Cross-repo context

Why this lives in BQAA

Scope

In scope (this issue)

Out of scope (this issue)

Module: `bigquery_agent_analytics.evalbench`

Mirror-table contract (authoritative)

Required top-level columns

Implementation traps

Synthetic row construction (per scenario)

EvalBench result schema variability

CLI surface

Python API example

Tests

Docs

Effort breakdown (P0 = MVP)

Acceptance criteria

Coordination

Open questions

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Priority	Item	Effort
P0	Design mapping doc: EvalBench rows → synthetic BQAA trace rows	0.5 wk
P0	`EvalBenchRun.from_bigquery` reader	0.5 wk
P0	`EvalBenchRun.materialize` writer + mirror-table DDL	0.5–1 wk
P0	`evalbench-import` CLI	0.25 wk
P0	`evalbench-score` CLI (thin wrapper around `Client.evaluate`)	0.25 wk
P0	Unit + integration tests	0.5 wk
P0	Docs + Gemini CLI live demo	0.5 wk
P1	`evalbench_scores_imported` schema + writer	0.5 wk
P1	Looker view template joining EvalBench scores with BQAA scorecard outputs	0.5 wk

Add EvalBench bridge: import EvalBench BigQuery runs into a BQAA-owned mirror table #97

Description

Add EvalBench bridge: import EvalBench BigQuery runs into a BQAA-owned mirror table

Cross-repo context

Why this lives in BQAA

Scope

In scope (this issue)

Out of scope (this issue)

Module: bigquery_agent_analytics.evalbench

Mirror-table contract (authoritative)

Required top-level columns

Implementation traps

Synthetic row construction (per scenario)

EvalBench result schema variability

CLI surface

Python API example

Tests

Docs

Effort breakdown (P0 = MVP)

Acceptance criteria

Coordination

Open questions

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Module: `bigquery_agent_analytics.evalbench`