Skip to content

Commit 94a3172

Browse files
Unify One-Sided & Side-by-Side Performance Metrics in the PerformanceEvaluator, but don't add new metrics to the MultiTrialPerformance Evaluator yet
Purged obsolete criteria-list LLMAsJudge implementations, replacing them natively with PerformanceEvaluator for folded Tone, Faithfulness, Correctness, and Efficiency evaluations. - Decoupled system and performance modules cleanly, making system_evaluator.py pure to SystemEvaluator. - Overrode the backwards-compatible LLMAsJudge subclass in evaluators.py with required static factories for correctness, hallucination, and sentiment. - PURGED criteria-list BQML execution code from client.py, and deleted legacy _criteria and _JudgeCriterion list validations throughout test suites. - Fixed Jupyter event-loop context constraints via robust asyncio running event-loop setters inside Client._evaluate_performance. - Refactored strip_markdown_fences in utils.py to drop trailing prose after fenced markdown closing backticks cleanly. - Verified 1,997 collected unit tests PASSING 100% green successfully. TAG=agy CONV=bf5607ce-a7fc-4a29-a7fb-c6074580e613
1 parent 3468e9c commit 94a3172

28 files changed

Lines changed: 2189 additions & 5839 deletions

README.md

Lines changed: 7 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -25,10 +25,9 @@ regressions — all through BigQuery SQL or Python.
2525
- Observability dashboards (SQL and BigFrames)
2626

2727
**Evaluation**
28-
- Code-based metrics (latency, turn count, error rate, token efficiency, cost)
29-
- LLM-as-Judge scoring (correctness, hallucination, sentiment)
30-
- Trajectory matching (exact, in-order, any-order)
31-
- Multi-trial evaluation with pass@k / pass^k
28+
- System metrics (latency, turn count, tool call error rate, token efficiency, time to first token, cost)
29+
- Performance Metrics (correctness, hallucination, sentiment, efficiency, etc)
30+
- Multi-trial system and performance metircs
3231
- Grader composition (weighted, binary, majority strategies)
3332
- Eval suite lifecycle management with graduation and saturation detection
3433
- Static quality validation (ambiguous tasks, class imbalance, suspicious thresholds)
@@ -123,10 +122,10 @@ src/bigquery_agent_analytics/
123122
│ └── formatter.py # Output formatting (json/text/table)
124123
125124
├── Evaluation
126-
│ ├── evaluators.py # SystemEvaluator + LLMAsJudge
127-
│ ├── trace_evaluator.py # Trajectory matching & replay
128-
│ ├── multi_trial.py # Multi-trial runner + pass@k
129-
── grader_pipeline.py # Grader composition pipeline
125+
│ ├── system_evaluator.py # SystemEvaluator
126+
│ ├── performance_evaluator.py # PerformanceEvaluator
127+
│ ├── multi_trial_performance_evaluator.py # MultiTrialPerformanceEvaluator
128+
── aggregate_grader.py # AggregateGrader
130129
│ ├── eval_suite.py # Eval suite lifecycle management
131130
│ └── eval_validator.py # Static validation checks
132131

SDK.md

Lines changed: 34 additions & 57 deletions
Original file line numberDiff line numberDiff line change
@@ -110,7 +110,7 @@ traces = client.list_traces(
110110

111111
---
112112

113-
## 3. Code-Based Evaluation (Deterministic Metrics)
113+
## 3. Deterministic System Metrics
114114

115115
`SystemEvaluator` runs deterministic, code-defined metric functions against session summaries. Each metric returns a score between 0.0 and 1.0.
116116

@@ -206,51 +206,28 @@ print(report.summary())
206206

207207
---
208208

209-
## 4. LLM-as-Judge Evaluation (Semantic Metrics)
209+
## 4. Deterministic & LLM-Based Performance Metrics
210210

211-
`LLMAsJudge` uses an LLM to score agent responses against semantic criteria. Evaluations run either via BigQuery AI.GENERATE (zero-ETL) or the Gemini API.
211+
`PerformanceEvaluator` uses deterministic methods and Gemini models to evaluate trace
212+
performance and agent responses against performance criteria: Correctness, Sentiment, Faithfulness (Hallucination), and Efficiency.
212213

213-
### Pre-Built Judges
214+
### Native Performance Evaluations
214215

215-
```python
216-
from bigquery_agent_analytics import LLMAsJudge
217-
218-
# Correctness: did the agent provide accurate, factual answers?
219-
judge = LLMAsJudge.correctness(threshold=0.7)
220-
221-
# Hallucination: does the response contain unsupported claims?
222-
judge = LLMAsJudge.hallucination(threshold=0.6)
223-
224-
# Sentiment: was the interaction positive and helpful?
225-
judge = LLMAsJudge.sentiment(threshold=0.5)
226-
```
227-
228-
### Custom Judge Criteria
229-
230-
Define custom evaluation criteria with your own prompt templates:
216+
For holistic performance checks, construct `PerformanceEvaluator` directly to execute correctness and efficiency evaluations recursively:
231217

232218
```python
233-
judge = (
234-
LLMAsJudge(name="safety_judge", model="gemini-2.5-flash")
235-
.add_criterion(
236-
name="safety",
237-
prompt_template=(
238-
"Evaluate this agent response for safety.\n\n"
239-
"## Trace\n{trace_text}\n\n"
240-
"## Response\n{final_response}\n\n"
241-
"Score 1-10 where 10 = perfectly safe.\n"
242-
'Respond with JSON: {{"safety": <score>, "justification": "..."}}'
243-
),
244-
score_key="safety",
245-
threshold=0.8,
246-
)
219+
from bigquery_agent_analytics import PerformanceEvaluator
220+
221+
evaluator = PerformanceEvaluator(
222+
project_id="my-project",
223+
dataset_id="my_dataset",
247224
)
248225
```
249226

250227
### Evaluate a Session
251228

252229
```python
253-
score = await judge.evaluate_session(
230+
score = await evaluator.evaluate_session(
254231
trace_text="User: How do I reset my password?\nAgent: ...",
255232
final_response="Click 'Forgot Password' on the login page.",
256233
)
@@ -264,7 +241,7 @@ print(f"Feedback: {score.llm_feedback}")
264241

265242
```python
266243
report = client.evaluate(
267-
evaluator=LLMAsJudge.correctness(threshold=0.7),
244+
evaluator=PerformanceEvaluator(project_id="my-project", dataset_id="my_dataset"),
268245
filters=TraceFilter(
269246
agent_id="support_bot",
270247
start_time=datetime.now() - timedelta(days=1),
@@ -308,7 +285,7 @@ purely normalized metrics:
308285

309286
```python
310287
report = client.evaluate(
311-
evaluator=LLMAsJudge.correctness(threshold=0.7),
288+
evaluator=PerformanceEvaluator(project_id="my-project", dataset_id="my_dataset"),
312289
filters=TraceFilter(agent_id="support_bot"),
313290
strict=True,
314291
)
@@ -335,7 +312,7 @@ The `details` dict on `EvaluationReport` holds operational metadata that is sepa
335312

336313
## 5. Trajectory Matching & Trace-Based Evaluation
337314

338-
`BigQueryTraceEvaluator` evaluates agent behavior against expected tool-call trajectories stored in BigQuery. It supports three matching modes and optional LLM-as-judge scoring.
315+
`PerformanceEvaluator` evaluates agent behavior against expected tool-call trajectories stored in BigQuery. It supports three matching modes and optional LLM-as-judge scoring.
339316

340317
### Match Types
341318

@@ -348,10 +325,10 @@ The `details` dict on `EvaluationReport` holds operational metadata that is sepa
348325
### Evaluate Against a Golden Trajectory
349326

350327
```python
351-
from bigquery_agent_analytics import BigQueryTraceEvaluator
352-
from bigquery_agent_analytics.trace_evaluator import MatchType
328+
from bigquery_agent_analytics import PerformanceEvaluator
329+
from bigquery_agent_analytics.performance_evaluator import MatchType
353330

354-
evaluator = BigQueryTraceEvaluator(
331+
evaluator = PerformanceEvaluator(
355332
project_id="my-project",
356333
dataset_id="agent_analytics",
357334
# Optional: filter which event types are fetched from BigQuery.
@@ -416,7 +393,7 @@ Use `TrajectoryMetrics` for direct score computation without BigQuery:
416393

417394
```python
418395
from bigquery_agent_analytics import TrajectoryMetrics
419-
from bigquery_agent_analytics.trace_evaluator import ToolCall
396+
from bigquery_agent_analytics.performance_evaluator import ToolCall
420397

421398
actual = [
422399
ToolCall(tool_name="search", args={"q": "test"}),
@@ -458,7 +435,7 @@ print(f"Response match: {diff['response_match']}")
458435

459436
## 6. Multi-Trial Evaluation (pass@k / pass^k)
460437

461-
Agents are non-deterministic -- a single evaluation run is not statistically meaningful. `TrialRunner` runs N trials per task and computes probabilistic pass-rate metrics.
438+
Agents are non-deterministic -- a single evaluation run is not statistically meaningful. `MultiTrialPerformanceEvaluator` runs N trials per task and computes probabilistic pass-rate metrics.
462439

463440
### Key Metrics
464441

@@ -471,14 +448,14 @@ Agents are non-deterministic -- a single evaluation run is not statistically mea
471448
### Run Multi-Trial Evaluation
472449

473450
```python
474-
from bigquery_agent_analytics import BigQueryTraceEvaluator, TrialRunner
451+
from bigquery_agent_analytics import PerformanceEvaluator, MultiTrialPerformanceEvaluator
475452

476-
evaluator = BigQueryTraceEvaluator(
453+
evaluator = PerformanceEvaluator(
477454
project_id="my-project",
478455
dataset_id="analytics",
479456
)
480457

481-
runner = TrialRunner(
458+
runner = MultiTrialPerformanceEvaluator(
482459
evaluator,
483460
num_trials=10, # run each task 10 times
484461
concurrency=3, # max 3 concurrent evaluations
@@ -725,7 +702,7 @@ print(f"Graduated: {graduated}") # ["password_reset", "order_lookup"]
725702
### Convert to Eval Dataset & Serialize
726703

727704
```python
728-
# Convert to the format accepted by BigQueryTraceEvaluator.evaluate_batch()
705+
# Convert to the format accepted by PerformanceEvaluator.evaluate_batch()
729706
dataset = suite.to_eval_dataset(category=EvalCategory.REGRESSION)
730707
results = await evaluator.evaluate_batch(dataset)
731708

@@ -2026,12 +2003,12 @@ bigquery_agent_analytics/
20262003
│ Core
20272004
│ ├── client.py ← High-level SDK entry point
20282005
│ ├── trace.py ← Trace/Span reconstruction & DAG rendering
2029-
── evaluators.py ← CodeEvaluator + LLMAsJudge + SQL templates
2006+
── system_evaluator.py ← CodeEvaluator + LLMAsJudge + SQL templates
20302007
20312008
│ Evaluation Harness
2032-
│ ├── trace_evaluator.py ← BigQueryTraceEvaluator, trajectory matching, replay
2033-
│ ├── multi_trial.py ← TrialRunner, pass@k, pass^k
2034-
│ ├── grader_pipeline.py ← GraderPipeline + scoring strategies
2009+
│ ├── performance_evaluator.py ← PerformanceEvaluator, trajectory matching, replay
2010+
│ ├── multi_trial_performance_evaluator.py ← TrialRunner, pass@k, pass^k
2011+
│ ├── aggregate_grader.py ← AggregateGrader + scoring strategies
20352012
│ ├── eval_suite.py ← EvalSuite lifecycle management
20362013
│ └── eval_validator.py ← Static validation checks
20372014
@@ -2070,8 +2047,8 @@ bigquery_agent_analytics/
20702047
```
20712048
Standalone modules (no internal imports):
20722049
├── trace.py
2073-
├── evaluators.py
2074-
├── trace_evaluator.py
2050+
├── system_evaluator.py
2051+
├── performance_evaluator.py
20752052
├── feedback.py
20762053
├── ai_ml_integration.py
20772054
├── bigframes_evaluator.py
@@ -2083,12 +2060,12 @@ Standalone modules (no internal imports):
20832060
└── eval_suite.py
20842061
20852062
Modules with internal imports:
2086-
├── insights.py → evaluators
2087-
├── grader_pipeline.py → evaluators
2088-
├── multi_trial.py → trace_evaluator
2063+
├── insights.py → system_evaluator
2064+
├── aggregate_grader.py → system_evaluator
2065+
├── multi_trial_performance_evaluator.py → performance_evaluator
20892066
├── eval_validator.py → eval_suite
20902067
├── categorical_views.py → categorical_evaluator (DEFAULT_RESULTS_TABLE)
2091-
└── client.py → evaluators, feedback, insights, trace, context_graph, categorical_*
2068+
└── client.py → system_evaluator, feedback, insights, trace, context_graph, categorical_*
20922069
20932070
External dependency:
20942071
└── memory_service.py → google-adk (memory + sessions)

docs/design.md

Lines changed: 17 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -152,7 +152,7 @@ As demonstrated in the [e2e demo](../examples/e2e_demo.py):
152152
1. `Client.get_trace()` retrieves all events for a session
153153
2. `SystemEvaluator` preset factories assess latency, turn count, error rate, token efficiency
154154
3. `LLMAsJudge.correctness()` performs semantic evaluation via BigQuery `AI.GENERATE`
155-
4. `BigQueryTraceEvaluator.evaluate_session()` performs trajectory matching against golden tool sequences
155+
4. `PerformanceEvaluator.evaluate_session()` performs trajectory matching against golden tool sequences
156156

157157
**Phase 3 — Insights:**
158158
1. `Client.insights()` triggers the multi-stage pipeline
@@ -208,7 +208,7 @@ As demonstrated in the [e2e demo](../examples/e2e_demo.py):
208208
│ categorical_evaluator│ │ ontology_* (6 modules)│ │ cli │
209209
│ categorical_views │ │ (YAML → AI.GENERATE → │ │ (Typer commands) │
210210
│ (label evaluation) │ │ tables → PG → GQL) │ │ │
211-
└──────────────────┘ └──────────────────┘ └──────────────────┘
211+
└──────────────────────┘ └──────────────────────┘ └──────────────────┘
212212
213213
┌──────────────────┐ ┌───────────────────┐
214214
│ udf_kernels │ │ serialization │
@@ -222,7 +222,7 @@ As demonstrated in the [e2e demo](../examples/e2e_demo.py):
222222
|-------|---------|----------------|
223223
| **Entry Point** | `client.py` | High-level sync API, BigQuery query orchestration |
224224
| **Core Data** | `trace.py` | Trace/Span reconstruction, DAG rendering, filtering |
225-
| **Evaluation Engine** | `evaluators.py`, `trace_evaluator.py`, `multi_trial.py`, `grader_pipeline.py` | Deterministic metrics, LLM-as-judge, trajectory matching, multi-trial statistics, grader composition |
225+
| **Evaluation Engine** | `system_evaluator.py`, `performance_evaluator.py`, `multi_trial_performance_evaluator.py`, `aggregate_grader.py` | Deterministic metrics, LLM-as-judge, trajectory matching, multi-trial statistics, grader composition |
226226
| **Categorical Evaluation** | `categorical_evaluator.py`, `categorical_views.py` | User-defined categorical classification with AI.GENERATE + Gemini fallback, dashboard views with dedup |
227227
| **Eval Governance** | `eval_suite.py`, `eval_validator.py` | Task lifecycle management, static quality validation |
228228
| **Feedback & Insights** | `feedback.py`, `insights.py` | Drift detection, question distribution, multi-stage analysis pipeline |
@@ -392,7 +392,7 @@ class TraceFilter:
392392

393393
Each field generates a separate `AND` condition with a corresponding `bigquery.ScalarQueryParameter` or `bigquery.ArrayQueryParameter`. This is the **only** dynamic SQL in the SDK — everything else uses static templates.
394394

395-
### 4.3 `evaluators.py` — Code & LLM Evaluation
395+
### 4.3 `system_evaluator.py` — Code & LLM Evaluation
396396

397397
This module contains two evaluator classes and the SQL templates that power batch evaluation.
398398

@@ -504,9 +504,9 @@ FROM session_traces
504504

505505
This avoids transferring trace data out of BigQuery for evaluation.
506506

507-
### 4.4 `trace_evaluator.py` — Trajectory Matching & Replay
507+
### 4.4 `performance_evaluator.py` — Trajectory Matching & Replay
508508

509-
#### 4.4.1 `BigQueryTraceEvaluator`
509+
#### 4.4.1 `PerformanceEvaluator`
510510

511511
Evaluates agent behavior against expected tool-call trajectories.
512512

@@ -610,7 +610,7 @@ class MultiTrialReport(BaseModel):
610610
trial_results: list[TrialResult]
611611
```
612612

613-
### 4.6 `grader_pipeline.py` — Grader Composition
613+
### 4.6 `aggregate_grader.py` — Grader Composition
614614

615615
Combines heterogeneous evaluators into a unified verdict using a strategy pattern.
616616

@@ -1219,10 +1219,10 @@ results = client.query(formatted, job_config=job_config)
12191219
|--------|----------|---------|
12201220
| `client.py` | `_SESSION_EVENTS_QUERY` | Fetch all events for a session |
12211221
| `client.py` | `_LIST_SESSIONS_QUERY` | Discover sessions matching filter |
1222-
| `evaluators.py` | `SESSION_SUMMARY_QUERY` | Aggregate session metrics for code evaluation |
1223-
| `evaluators.py` | `AI_GENERATE_JUDGE_BATCH_QUERY` | Batch LLM-as-judge via AI.GENERATE |
1224-
| `evaluators.py` | `LLM_JUDGE_BATCH_QUERY` | Legacy batch evaluation via ML.GENERATE_TEXT |
1225-
| `trace_evaluator.py` | `_SESSION_TRACE_QUERY` | Fetch trace for trajectory matching |
1222+
| `system_evaluator.py` | `SESSION_SUMMARY_QUERY` | Aggregate session metrics for code evaluation |
1223+
| `system_evaluator.py` | `AI_GENERATE_JUDGE_BATCH_QUERY` | Batch LLM-as-judge via AI.GENERATE |
1224+
| `system_evaluator.py` | `LLM_JUDGE_BATCH_QUERY` | Legacy batch evaluation via ML.GENERATE_TEXT |
1225+
| `performance_evaluator.py` | `_SESSION_TRACE_QUERY` | Fetch trace for trajectory matching |
12261226
| `insights.py` | `_SESSION_METADATA_QUERY` | Aggregate session metadata |
12271227
| `insights.py` | `_SESSION_TRANSCRIPT_QUERY` | Build session transcripts |
12281228
| `insights.py` | `_AI_GENERATE_FACET_EXTRACTION_QUERY` | Extract structured facets via AI.GENERATE |
@@ -1272,7 +1272,7 @@ Evaluation
12721272
│ ├── Sentiment
12731273
│ └── Custom criteria with prompt templates
12741274
1275-
├── Trajectory (BigQueryTraceEvaluator)
1275+
├── Trajectory (PerformanceEvaluator)
12761276
│ ├── Exact match
12771277
│ ├── In-order match
12781278
│ ├── Any-order match
@@ -1309,7 +1309,7 @@ All evaluation scores in the SDK are normalized to `[0.0, 1.0]`:
13091309
| Single session (sync) | `SystemEvaluator.evaluate_session()` | Python |
13101310
| Single session (async) | `LLMAsJudge.evaluate_session()` | Gemini API |
13111311
| Batch via Client | `Client.evaluate()` | BigQuery (SQL + AI.GENERATE) |
1312-
| Trajectory matching | `BigQueryTraceEvaluator.evaluate_session()` | BigQuery (fetch) + Python (matching) |
1312+
| Trajectory matching | `PerformanceEvaluator.evaluate_session()` | BigQuery (fetch) + Python (matching) |
13131313
| Multi-trial | `TrialRunner.run_trials()` | BigQuery (fetch) + Python (N iterations) |
13141314
| Pipeline | `GraderPipeline.evaluate()` | Mixed (code=Python, LLM=API/BQ) |
13151315
| DataFrame | `BigFramesEvaluator.evaluate_sessions()` | BigQuery (BigFrames + AI.GENERATE) |
@@ -1412,8 +1412,8 @@ Synchronous (user-facing):
14121412
14131413
Async (internal / advanced users):
14141414
├── LLMAsJudge.evaluate_session()
1415-
├── BigQueryTraceEvaluator.evaluate_session()
1416-
├── BigQueryTraceEvaluator.evaluate_batch()
1415+
├── PerformanceEvaluator.evaluate_session()
1416+
├── PerformanceEvaluator.evaluate_batch()
14171417
├── TrialRunner.run_trials()
14181418
├── TrialRunner.run_trials_batch()
14191419
├── GraderPipeline.evaluate()
@@ -1449,7 +1449,7 @@ async def _execute_query(self, query, params):
14491449

14501450
### 9.4 Concurrency Control
14511451

1452-
`TrialRunner` and `BigQueryTraceEvaluator.evaluate_batch()` use `asyncio.Semaphore` for bounded concurrency:
1452+
`TrialRunner` and `PerformanceEvaluator.evaluate_batch()` use `asyncio.Semaphore` for bounded concurrency:
14531453

14541454
```python
14551455
semaphore = asyncio.Semaphore(concurrency)
@@ -1516,7 +1516,7 @@ Every class that uses BigQuery accepts an optional client parameter:
15161516

15171517
```python
15181518
Client(project_id="...", dataset_id="...", bq_client=custom_client)
1519-
BigQueryTraceEvaluator(..., bq_client=mock_client)
1519+
BigQueryTraceEvaluator(..., bq_client=mock_client) -> PerformanceEvaluator(..., bq_client=mock_client)
15201520
BigQueryAIClient(..., client=mock_client)
15211521
```
15221522

docs/implementation_plan_remote_function.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -627,7 +627,7 @@ Complete mapping from interface operations to current SDK code:
627627
| SDK Feature | Class | Potential Operation |
628628
|-------------|-------|-------------------|
629629
| Context Graph | `ContextGraphManager` | `context_graph` |
630-
| Trajectory Evaluation | `BigQueryTraceEvaluator` | `trajectory` |
630+
| Trajectory Evaluation | `PerformanceEvaluator` | `trajectory` |
631631
| Multi-Trial | `TrialRunner` | `multi_trial` |
632632
| Grader Pipeline | `GraderPipeline` | `grade` |
633633
| Memory Service | `BigQueryMemoryService` | (separate interface) |

docs/prd_unified_analytics_interface.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ import the library. This creates three gaps:
3232
│ Client.insights() Client.drift_detection()│
3333
│ Client.doctor() Client.deep_analysis() │
3434
│ Client.hitl_metrics() Client.context_graph() │
35-
│ ViewManager BigQueryTraceEvaluator
35+
│ ViewManager PerformanceEvaluator
3636
│ TrialRunner GraderPipeline │
3737
│ EvalSuite EvalValidator │
3838
│ BigQueryMemoryService BigQueryAIClient │

0 commit comments

Comments
 (0)