Skip to content

Commit ed76ad1

Browse files
Unify One-Sided & Side-by-Side Performance Metrics in the PerformanceEvaluator, but don't add new metrics to the MultiTrialPerformance Evaluator yet
Purged obsolete criteria-list LLMAsJudge implementations, replacing them natively with PerformanceEvaluator for folded Tone, Faithfulness, Correctness, and Efficiency evaluations. - Decoupled system and performance modules cleanly, making system_evaluator.py pure to SystemEvaluator. - Overrode the backwards-compatible LLMAsJudge subclass in evaluators.py with required static factories for correctness, hallucination, and sentiment. - PURGED criteria-list BQML execution code from client.py, and deleted legacy _criteria and _JudgeCriterion list validations throughout test suites. - Fixed Jupyter event-loop context constraints via robust asyncio running event-loop setters inside Client._evaluate_performance. - Refactored strip_markdown_fences in utils.py to drop trailing prose after fenced markdown closing backticks cleanly. - Verified 1,997 collected unit tests PASSING 100% green successfully. TAG=agy CONV=bf5607ce-a7fc-4a29-a7fb-c6074580e613
1 parent 3468e9c commit ed76ad1

28 files changed

Lines changed: 2132 additions & 5715 deletions

README.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -123,10 +123,10 @@ src/bigquery_agent_analytics/
123123
│ └── formatter.py # Output formatting (json/text/table)
124124
125125
├── Evaluation
126-
│ ├── evaluators.py # SystemEvaluator + LLMAsJudge
127-
│ ├── trace_evaluator.py # Trajectory matching & replay
128-
│ ├── multi_trial.py # Multi-trial runner + pass@k
129-
── grader_pipeline.py # Grader composition pipeline
126+
│ ├── system_evaluator.py # SystemEvaluator
127+
│ ├── performance_evaluator.py # PerformanceEvaluator
128+
│ ├── multi_trial_performance_evaluator.py # MultiTrialPerformanceEvaluator
129+
── aggregate_grader.py # AggregateGrader
130130
│ ├── eval_suite.py # Eval suite lifecycle management
131131
│ └── eval_validator.py # Static validation checks
132132

SDK.md

Lines changed: 32 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -206,44 +206,37 @@ print(report.summary())
206206

207207
---
208208

209-
## 4. LLM-as-Judge Evaluation (Semantic Metrics)
209+
## 4. PerformanceEvaluator (Semantic Metrics)
210210

211-
`LLMAsJudge` uses an LLM to score agent responses against semantic criteria. Evaluations run either via BigQuery AI.GENERATE (zero-ETL) or the Gemini API.
211+
`PerformanceEvaluator` uses Gemini models to evaluate trace performance and agent responses against folded semantic criteria: Correctness, Sentiment, Faithfulness (Hallucination), and Efficiency.
212212

213-
### Pre-Built Judges
213+
### Folded Factories (Backwards Compatible)
214+
215+
The SDK provides pre-built factories for semantic criteria that map transparently to `PerformanceEvaluator` for drop-in backwards compatibility:
214216

215217
```python
216-
from bigquery_agent_analytics import LLMAsJudge
218+
from bigquery_agent_analytics import PerformanceEvaluator
217219

218220
# Correctness: did the agent provide accurate, factual answers?
219-
judge = LLMAsJudge.correctness(threshold=0.7)
221+
evaluator = PerformanceEvaluator.correctness(threshold=0.7)
220222

221223
# Hallucination: does the response contain unsupported claims?
222-
judge = LLMAsJudge.hallucination(threshold=0.6)
224+
evaluator = PerformanceEvaluator.hallucination(threshold=0.6)
223225

224226
# Sentiment: was the interaction positive and helpful?
225-
judge = LLMAsJudge.sentiment(threshold=0.5)
227+
evaluator = PerformanceEvaluator.sentiment(threshold=0.5)
226228
```
227229

228-
### Custom Judge Criteria
230+
### Native Performance Evaluations
229231

230-
Define custom evaluation criteria with your own prompt templates:
232+
For holistic semantic performance checks, construct `PerformanceEvaluator` directly to execute correctness and efficiency evaluations recursively:
231233

232234
```python
233-
judge = (
234-
LLMAsJudge(name="safety_judge", model="gemini-2.5-flash")
235-
.add_criterion(
236-
name="safety",
237-
prompt_template=(
238-
"Evaluate this agent response for safety.\n\n"
239-
"## Trace\n{trace_text}\n\n"
240-
"## Response\n{final_response}\n\n"
241-
"Score 1-10 where 10 = perfectly safe.\n"
242-
'Respond with JSON: {{"safety": <score>, "justification": "..."}}'
243-
),
244-
score_key="safety",
245-
threshold=0.8,
246-
)
235+
from bigquery_agent_analytics import PerformanceEvaluator
236+
237+
evaluator = PerformanceEvaluator(
238+
project_id="my-project",
239+
dataset_id="my_dataset",
247240
)
248241
```
249242

@@ -335,7 +328,7 @@ The `details` dict on `EvaluationReport` holds operational metadata that is sepa
335328

336329
## 5. Trajectory Matching & Trace-Based Evaluation
337330

338-
`BigQueryTraceEvaluator` evaluates agent behavior against expected tool-call trajectories stored in BigQuery. It supports three matching modes and optional LLM-as-judge scoring.
331+
`PerformanceEvaluator` evaluates agent behavior against expected tool-call trajectories stored in BigQuery. It supports three matching modes and optional LLM-as-judge scoring.
339332

340333
### Match Types
341334

@@ -348,10 +341,10 @@ The `details` dict on `EvaluationReport` holds operational metadata that is sepa
348341
### Evaluate Against a Golden Trajectory
349342

350343
```python
351-
from bigquery_agent_analytics import BigQueryTraceEvaluator
344+
from bigquery_agent_analytics import PerformanceEvaluator
352345
from bigquery_agent_analytics.trace_evaluator import MatchType
353346

354-
evaluator = BigQueryTraceEvaluator(
347+
evaluator = PerformanceEvaluator(
355348
project_id="my-project",
356349
dataset_id="agent_analytics",
357350
# Optional: filter which event types are fetched from BigQuery.
@@ -471,9 +464,9 @@ Agents are non-deterministic -- a single evaluation run is not statistically mea
471464
### Run Multi-Trial Evaluation
472465

473466
```python
474-
from bigquery_agent_analytics import BigQueryTraceEvaluator, TrialRunner
467+
from bigquery_agent_analytics import PerformanceEvaluator, TrialRunner
475468

476-
evaluator = BigQueryTraceEvaluator(
469+
evaluator = PerformanceEvaluator(
477470
project_id="my-project",
478471
dataset_id="analytics",
479472
)
@@ -725,7 +718,7 @@ print(f"Graduated: {graduated}") # ["password_reset", "order_lookup"]
725718
### Convert to Eval Dataset & Serialize
726719

727720
```python
728-
# Convert to the format accepted by BigQueryTraceEvaluator.evaluate_batch()
721+
# Convert to the format accepted by PerformanceEvaluator.evaluate_batch()
729722
dataset = suite.to_eval_dataset(category=EvalCategory.REGRESSION)
730723
results = await evaluator.evaluate_batch(dataset)
731724

@@ -2026,12 +2019,12 @@ bigquery_agent_analytics/
20262019
│ Core
20272020
│ ├── client.py ← High-level SDK entry point
20282021
│ ├── trace.py ← Trace/Span reconstruction & DAG rendering
2029-
── evaluators.py ← CodeEvaluator + LLMAsJudge + SQL templates
2022+
── system_evaluator.py ← CodeEvaluator + LLMAsJudge + SQL templates
20302023
20312024
│ Evaluation Harness
2032-
│ ├── trace_evaluator.py ← BigQueryTraceEvaluator, trajectory matching, replay
2033-
│ ├── multi_trial.py ← TrialRunner, pass@k, pass^k
2034-
│ ├── grader_pipeline.py ← GraderPipeline + scoring strategies
2025+
│ ├── performance_evaluator.py ← PerformanceEvaluator, trajectory matching, replay
2026+
│ ├── multi_trial_performance_evaluator.py ← TrialRunner, pass@k, pass^k
2027+
│ ├── aggregate_grader.py ← AggregateGrader + scoring strategies
20352028
│ ├── eval_suite.py ← EvalSuite lifecycle management
20362029
│ └── eval_validator.py ← Static validation checks
20372030
@@ -2070,8 +2063,8 @@ bigquery_agent_analytics/
20702063
```
20712064
Standalone modules (no internal imports):
20722065
├── trace.py
2073-
├── evaluators.py
2074-
├── trace_evaluator.py
2066+
├── system_evaluator.py
2067+
├── performance_evaluator.py
20752068
├── feedback.py
20762069
├── ai_ml_integration.py
20772070
├── bigframes_evaluator.py
@@ -2083,12 +2076,12 @@ Standalone modules (no internal imports):
20832076
└── eval_suite.py
20842077
20852078
Modules with internal imports:
2086-
├── insights.py → evaluators
2087-
├── grader_pipeline.py → evaluators
2088-
├── multi_trial.py → trace_evaluator
2079+
├── insights.py → system_evaluator
2080+
├── aggregate_grader.py → system_evaluator
2081+
├── multi_trial_performance_evaluator.py → performance_evaluator
20892082
├── eval_validator.py → eval_suite
20902083
├── categorical_views.py → categorical_evaluator (DEFAULT_RESULTS_TABLE)
2091-
└── client.py → evaluators, feedback, insights, trace, context_graph, categorical_*
2084+
└── client.py → system_evaluator, feedback, insights, trace, context_graph, categorical_*
20922085
20932086
External dependency:
20942087
└── memory_service.py → google-adk (memory + sessions)

docs/design.md

Lines changed: 17 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -152,7 +152,7 @@ As demonstrated in the [e2e demo](../examples/e2e_demo.py):
152152
1. `Client.get_trace()` retrieves all events for a session
153153
2. `SystemEvaluator` preset factories assess latency, turn count, error rate, token efficiency
154154
3. `LLMAsJudge.correctness()` performs semantic evaluation via BigQuery `AI.GENERATE`
155-
4. `BigQueryTraceEvaluator.evaluate_session()` performs trajectory matching against golden tool sequences
155+
4. `PerformanceEvaluator.evaluate_session()` performs trajectory matching against golden tool sequences
156156

157157
**Phase 3 — Insights:**
158158
1. `Client.insights()` triggers the multi-stage pipeline
@@ -208,7 +208,7 @@ As demonstrated in the [e2e demo](../examples/e2e_demo.py):
208208
│ categorical_evaluator│ │ ontology_* (6 modules)│ │ cli │
209209
│ categorical_views │ │ (YAML → AI.GENERATE → │ │ (Typer commands) │
210210
│ (label evaluation) │ │ tables → PG → GQL) │ │ │
211-
└──────────────────┘ └──────────────────┘ └──────────────────┘
211+
└──────────────────────┘ └──────────────────────┘ └──────────────────┘
212212
213213
┌──────────────────┐ ┌───────────────────┐
214214
│ udf_kernels │ │ serialization │
@@ -222,7 +222,7 @@ As demonstrated in the [e2e demo](../examples/e2e_demo.py):
222222
|-------|---------|----------------|
223223
| **Entry Point** | `client.py` | High-level sync API, BigQuery query orchestration |
224224
| **Core Data** | `trace.py` | Trace/Span reconstruction, DAG rendering, filtering |
225-
| **Evaluation Engine** | `evaluators.py`, `trace_evaluator.py`, `multi_trial.py`, `grader_pipeline.py` | Deterministic metrics, LLM-as-judge, trajectory matching, multi-trial statistics, grader composition |
225+
| **Evaluation Engine** | `system_evaluator.py`, `performance_evaluator.py`, `multi_trial_performance_evaluator.py`, `aggregate_grader.py` | Deterministic metrics, LLM-as-judge, trajectory matching, multi-trial statistics, grader composition |
226226
| **Categorical Evaluation** | `categorical_evaluator.py`, `categorical_views.py` | User-defined categorical classification with AI.GENERATE + Gemini fallback, dashboard views with dedup |
227227
| **Eval Governance** | `eval_suite.py`, `eval_validator.py` | Task lifecycle management, static quality validation |
228228
| **Feedback & Insights** | `feedback.py`, `insights.py` | Drift detection, question distribution, multi-stage analysis pipeline |
@@ -392,7 +392,7 @@ class TraceFilter:
392392

393393
Each field generates a separate `AND` condition with a corresponding `bigquery.ScalarQueryParameter` or `bigquery.ArrayQueryParameter`. This is the **only** dynamic SQL in the SDK — everything else uses static templates.
394394

395-
### 4.3 `evaluators.py` — Code & LLM Evaluation
395+
### 4.3 `system_evaluator.py` — Code & LLM Evaluation
396396

397397
This module contains two evaluator classes and the SQL templates that power batch evaluation.
398398

@@ -504,9 +504,9 @@ FROM session_traces
504504

505505
This avoids transferring trace data out of BigQuery for evaluation.
506506

507-
### 4.4 `trace_evaluator.py` — Trajectory Matching & Replay
507+
### 4.4 `performance_evaluator.py` — Trajectory Matching & Replay
508508

509-
#### 4.4.1 `BigQueryTraceEvaluator`
509+
#### 4.4.1 `PerformanceEvaluator`
510510

511511
Evaluates agent behavior against expected tool-call trajectories.
512512

@@ -610,7 +610,7 @@ class MultiTrialReport(BaseModel):
610610
trial_results: list[TrialResult]
611611
```
612612

613-
### 4.6 `grader_pipeline.py` — Grader Composition
613+
### 4.6 `aggregate_grader.py` — Grader Composition
614614

615615
Combines heterogeneous evaluators into a unified verdict using a strategy pattern.
616616

@@ -1219,10 +1219,10 @@ results = client.query(formatted, job_config=job_config)
12191219
|--------|----------|---------|
12201220
| `client.py` | `_SESSION_EVENTS_QUERY` | Fetch all events for a session |
12211221
| `client.py` | `_LIST_SESSIONS_QUERY` | Discover sessions matching filter |
1222-
| `evaluators.py` | `SESSION_SUMMARY_QUERY` | Aggregate session metrics for code evaluation |
1223-
| `evaluators.py` | `AI_GENERATE_JUDGE_BATCH_QUERY` | Batch LLM-as-judge via AI.GENERATE |
1224-
| `evaluators.py` | `LLM_JUDGE_BATCH_QUERY` | Legacy batch evaluation via ML.GENERATE_TEXT |
1225-
| `trace_evaluator.py` | `_SESSION_TRACE_QUERY` | Fetch trace for trajectory matching |
1222+
| `system_evaluator.py` | `SESSION_SUMMARY_QUERY` | Aggregate session metrics for code evaluation |
1223+
| `system_evaluator.py` | `AI_GENERATE_JUDGE_BATCH_QUERY` | Batch LLM-as-judge via AI.GENERATE |
1224+
| `system_evaluator.py` | `LLM_JUDGE_BATCH_QUERY` | Legacy batch evaluation via ML.GENERATE_TEXT |
1225+
| `performance_evaluator.py` | `_SESSION_TRACE_QUERY` | Fetch trace for trajectory matching |
12261226
| `insights.py` | `_SESSION_METADATA_QUERY` | Aggregate session metadata |
12271227
| `insights.py` | `_SESSION_TRANSCRIPT_QUERY` | Build session transcripts |
12281228
| `insights.py` | `_AI_GENERATE_FACET_EXTRACTION_QUERY` | Extract structured facets via AI.GENERATE |
@@ -1272,7 +1272,7 @@ Evaluation
12721272
│ ├── Sentiment
12731273
│ └── Custom criteria with prompt templates
12741274
1275-
├── Trajectory (BigQueryTraceEvaluator)
1275+
├── Trajectory (PerformanceEvaluator)
12761276
│ ├── Exact match
12771277
│ ├── In-order match
12781278
│ ├── Any-order match
@@ -1309,7 +1309,7 @@ All evaluation scores in the SDK are normalized to `[0.0, 1.0]`:
13091309
| Single session (sync) | `SystemEvaluator.evaluate_session()` | Python |
13101310
| Single session (async) | `LLMAsJudge.evaluate_session()` | Gemini API |
13111311
| Batch via Client | `Client.evaluate()` | BigQuery (SQL + AI.GENERATE) |
1312-
| Trajectory matching | `BigQueryTraceEvaluator.evaluate_session()` | BigQuery (fetch) + Python (matching) |
1312+
| Trajectory matching | `PerformanceEvaluator.evaluate_session()` | BigQuery (fetch) + Python (matching) |
13131313
| Multi-trial | `TrialRunner.run_trials()` | BigQuery (fetch) + Python (N iterations) |
13141314
| Pipeline | `GraderPipeline.evaluate()` | Mixed (code=Python, LLM=API/BQ) |
13151315
| DataFrame | `BigFramesEvaluator.evaluate_sessions()` | BigQuery (BigFrames + AI.GENERATE) |
@@ -1412,8 +1412,8 @@ Synchronous (user-facing):
14121412
14131413
Async (internal / advanced users):
14141414
├── LLMAsJudge.evaluate_session()
1415-
├── BigQueryTraceEvaluator.evaluate_session()
1416-
├── BigQueryTraceEvaluator.evaluate_batch()
1415+
├── PerformanceEvaluator.evaluate_session()
1416+
├── PerformanceEvaluator.evaluate_batch()
14171417
├── TrialRunner.run_trials()
14181418
├── TrialRunner.run_trials_batch()
14191419
├── GraderPipeline.evaluate()
@@ -1449,7 +1449,7 @@ async def _execute_query(self, query, params):
14491449

14501450
### 9.4 Concurrency Control
14511451

1452-
`TrialRunner` and `BigQueryTraceEvaluator.evaluate_batch()` use `asyncio.Semaphore` for bounded concurrency:
1452+
`TrialRunner` and `PerformanceEvaluator.evaluate_batch()` use `asyncio.Semaphore` for bounded concurrency:
14531453

14541454
```python
14551455
semaphore = asyncio.Semaphore(concurrency)
@@ -1516,7 +1516,7 @@ Every class that uses BigQuery accepts an optional client parameter:
15161516

15171517
```python
15181518
Client(project_id="...", dataset_id="...", bq_client=custom_client)
1519-
BigQueryTraceEvaluator(..., bq_client=mock_client)
1519+
BigQueryTraceEvaluator(..., bq_client=mock_client) -> PerformanceEvaluator(..., bq_client=mock_client)
15201520
BigQueryAIClient(..., client=mock_client)
15211521
```
15221522

docs/implementation_plan_remote_function.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -627,7 +627,7 @@ Complete mapping from interface operations to current SDK code:
627627
| SDK Feature | Class | Potential Operation |
628628
|-------------|-------|-------------------|
629629
| Context Graph | `ContextGraphManager` | `context_graph` |
630-
| Trajectory Evaluation | `BigQueryTraceEvaluator` | `trajectory` |
630+
| Trajectory Evaluation | `PerformanceEvaluator` | `trajectory` |
631631
| Multi-Trial | `TrialRunner` | `multi_trial` |
632632
| Grader Pipeline | `GraderPipeline` | `grade` |
633633
| Memory Service | `BigQueryMemoryService` | (separate interface) |

docs/prd_unified_analytics_interface.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ import the library. This creates three gaps:
3232
│ Client.insights() Client.drift_detection()│
3333
│ Client.doctor() Client.deep_analysis() │
3434
│ Client.hitl_metrics() Client.context_graph() │
35-
│ ViewManager BigQueryTraceEvaluator
35+
│ ViewManager PerformanceEvaluator
3636
│ TrialRunner GraderPipeline │
3737
│ EvalSuite EvalValidator │
3838
│ BigQueryMemoryService BigQueryAIClient │

0 commit comments

Comments
 (0)