Skip to content

Commit 914119a

Browse files
Refactor: Rename CodeEvaluator to SystemEvaluator
Rename CodeEvaluator to SystemEvaluator to align with its focus on system-level metrics. A CodeEvaluator alias is kept in evaluators.py for backward-compatibility.
1 parent 292320b commit 914119a

19 files changed

Lines changed: 193 additions & 170 deletions

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -123,7 +123,7 @@ src/bigquery_agent_analytics/
123123
│ └── formatter.py # Output formatting (json/text/table)
124124
125125
├── Evaluation
126-
│ ├── evaluators.py # CodeEvaluator + LLMAsJudge
126+
│ ├── evaluators.py # SystemEvaluator + LLMAsJudge
127127
│ ├── trace_evaluator.py # Trajectory matching & replay
128128
│ ├── multi_trial.py # Multi-trial runner + pass@k
129129
│ ├── grader_pipeline.py # Grader composition pipeline

SDK.md

Lines changed: 16 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -112,29 +112,29 @@ traces = client.list_traces(
112112

113113
## 3. Code-Based Evaluation (Deterministic Metrics)
114114

115-
`CodeEvaluator` runs deterministic, code-defined metric functions against session summaries. Each metric returns a score between 0.0 and 1.0.
115+
`SystemEvaluator` runs deterministic, code-defined metric functions against session summaries. Each metric returns a score between 0.0 and 1.0.
116116

117117
### Pre-Built Evaluators
118118

119119
The SDK ships with six ready-to-use evaluators:
120120

121121
```python
122-
from bigquery_agent_analytics import CodeEvaluator
122+
from bigquery_agent_analytics import SystemEvaluator
123123

124124
# Latency: score degrades linearly as avg latency approaches threshold
125-
evaluator = CodeEvaluator.latency(threshold_ms=5000)
125+
evaluator = SystemEvaluator.latency(threshold_ms=5000)
126126

127127
# Turn count: penalizes sessions with too many back-and-forth turns
128-
evaluator = CodeEvaluator.turn_count(max_turns=10)
128+
evaluator = SystemEvaluator.turn_count(max_turns=10)
129129

130130
# Error rate: penalizes high tool error rates
131-
evaluator = CodeEvaluator.error_rate(max_error_rate=0.1)
131+
evaluator = SystemEvaluator.error_rate(max_error_rate=0.1)
132132

133133
# Token efficiency: checks total token usage stays within budget
134-
evaluator = CodeEvaluator.token_efficiency(max_tokens=50000)
134+
evaluator = SystemEvaluator.token_efficiency(max_tokens=50000)
135135

136136
# Cost per session: checks estimated USD cost stays under budget
137-
evaluator = CodeEvaluator.cost_per_session(
137+
evaluator = SystemEvaluator.cost_per_session(
138138
max_cost_usd=1.0,
139139
input_cost_per_1k=0.00025,
140140
output_cost_per_1k=0.00125,
@@ -147,7 +147,7 @@ Define your own metric functions and chain multiple metrics together:
147147

148148
```python
149149
evaluator = (
150-
CodeEvaluator(name="my_quality_check")
150+
SystemEvaluator(name="my_quality_check")
151151
.add_metric(
152152
name="latency",
153153
fn=lambda s: 1.0 - min(s.get("avg_latency_ms", 0) / 5000, 1.0),
@@ -190,7 +190,7 @@ Run evaluation across all sessions matching a filter:
190190
from bigquery_agent_analytics import TraceFilter
191191

192192
report = client.evaluate(
193-
evaluator=CodeEvaluator.latency(threshold_ms=3000),
193+
evaluator=SystemEvaluator.latency(threshold_ms=3000),
194194
filters=TraceFilter(agent_id="my_agent"),
195195
)
196196

@@ -535,7 +535,7 @@ pass_pow_k = compute_pass_pow_k(num_trials=10, num_passed=8) # ~0.107
535535

536536
## 7. Grader Composition Pipeline
537537

538-
Combine multiple evaluators (`CodeEvaluator` + `LLMAsJudge` + custom functions) into a single aggregated verdict using configurable scoring strategies.
538+
Combine multiple evaluators (`SystemEvaluator` + `LLMAsJudge` + custom functions) into a single aggregated verdict using configurable scoring strategies.
539539

540540
### Scoring Strategies
541541

@@ -549,7 +549,7 @@ Combine multiple evaluators (`CodeEvaluator` + `LLMAsJudge` + custom functions)
549549

550550
```python
551551
from bigquery_agent_analytics import (
552-
CodeEvaluator, GraderPipeline, LLMAsJudge,
552+
SystemEvaluator, GraderPipeline, LLMAsJudge,
553553
WeightedStrategy, GraderResult,
554554
)
555555

@@ -562,8 +562,8 @@ pipeline = (
562562
},
563563
threshold=0.6,
564564
))
565-
.add_code_grader(CodeEvaluator.latency(threshold_ms=5000), weight=0.2)
566-
.add_code_grader(CodeEvaluator.cost_per_session(max_cost_usd=0.50), weight=0.1)
565+
.add_code_grader(SystemEvaluator.latency(threshold_ms=5000), weight=0.2)
566+
.add_code_grader(SystemEvaluator.cost_per_session(max_cost_usd=0.50), weight=0.1)
567567
.add_llm_grader(LLMAsJudge.correctness(threshold=0.7), weight=0.7)
568568
)
569569

@@ -592,8 +592,8 @@ from bigquery_agent_analytics import BinaryStrategy
592592

593593
pipeline = (
594594
GraderPipeline(BinaryStrategy())
595-
.add_code_grader(CodeEvaluator.latency(threshold_ms=3000))
596-
.add_code_grader(CodeEvaluator.error_rate(max_error_rate=0.05))
595+
.add_code_grader(SystemEvaluator.latency(threshold_ms=3000))
596+
.add_code_grader(SystemEvaluator.error_rate(max_error_rate=0.05))
597597
.add_llm_grader(LLMAsJudge.hallucination(threshold=0.8))
598598
)
599599

@@ -623,7 +623,7 @@ def business_rules_grader(context):
623623

624624
pipeline = (
625625
GraderPipeline(BinaryStrategy())
626-
.add_code_grader(CodeEvaluator.latency())
626+
.add_code_grader(SystemEvaluator.latency())
627627
.add_custom_grader("business_rules", business_rules_grader)
628628
)
629629
```

docs/design.md

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -150,7 +150,7 @@ As demonstrated in the [e2e demo](../examples/e2e_demo.py):
150150

151151
**Phase 2 — Evaluation:**
152152
1. `Client.get_trace()` retrieves all events for a session
153-
2. `CodeEvaluator` preset factories assess latency, turn count, error rate, token efficiency
153+
2. `SystemEvaluator` preset factories assess latency, turn count, error rate, token efficiency
154154
3. `LLMAsJudge.correctness()` performs semantic evaluation via BigQuery `AI.GENERATE`
155155
4. `BigQueryTraceEvaluator.evaluate_session()` performs trajectory matching against golden tool sequences
156156

@@ -208,7 +208,7 @@ As demonstrated in the [e2e demo](../examples/e2e_demo.py):
208208
│ categorical_evaluator│ │ ontology_* (6 modules)│ │ cli │
209209
│ categorical_views │ │ (YAML → AI.GENERATE → │ │ (Typer commands) │
210210
│ (label evaluation) │ │ tables → PG → GQL) │ │ │
211-
└──────────────────────┘ └──────────────────────┘ └──────────────────┘
211+
└──────────────────┘ └──────────────────┘ └──────────────────┘
212212
213213
┌──────────────────┐ ┌───────────────────┐
214214
│ udf_kernels │ │ serialization │
@@ -248,7 +248,7 @@ Aggregations, filtering, joins, and even LLM evaluation (via `AI.GENERATE`) are
248248
LLM-based evaluation can run via (1) BigQuery `AI.GENERATE`, (2) legacy BigQuery ML `ML.GENERATE_TEXT`, or (3) the Gemini API directly. This maximizes compatibility across different GCP configurations.
249249

250250
**Decision 4: Composition over inheritance.**
251-
The `GraderPipeline` composes `CodeEvaluator`, `LLMAsJudge`, and custom functions via a builder pattern rather than requiring them to share a common base class. The `BigQueryMemoryService` composes four internal services rather than extending a single monolithic class.
251+
The `GraderPipeline` composes `SystemEvaluator`, `LLMAsJudge`, and custom functions via a builder pattern rather than requiring them to share a common base class. The `BigQueryMemoryService` composes four internal services rather than extending a single monolithic class.
252252

253253
---
254254

@@ -396,7 +396,7 @@ Each field generates a separate `AND` condition with a corresponding `bigquery.S
396396

397397
This module contains two evaluator classes and the SQL templates that power batch evaluation.
398398

399-
#### 4.3.1 `CodeEvaluator`
399+
#### 4.3.1 `SystemEvaluator`
400400

401401
Deterministic evaluation using code-defined metric functions.
402402

@@ -626,7 +626,7 @@ Combines heterogeneous evaluators into a unified verdict using a strategy patter
626626
627627
┌──────────────┼──────────────┐
628628
▼ ▼ ▼
629-
CodeEvaluator LLMAsJudge Custom Fn
629+
SystemEvaluator LLMAsJudge Custom Fn
630630
(sync) (async) (sync)
631631
│ │ │
632632
▼ ▼ ▼
@@ -1258,7 +1258,7 @@ results = client.query(formatted, job_config=job_config)
12581258

12591259
```
12601260
Evaluation
1261-
├── Deterministic (CodeEvaluator)
1261+
├── Deterministic (SystemEvaluator)
12621262
│ ├── Latency
12631263
│ ├── Turn count
12641264
│ ├── Error rate
@@ -1306,7 +1306,7 @@ All evaluation scores in the SDK are normalized to `[0.0, 1.0]`:
13061306

13071307
| Mode | Evaluator | Where Computation Runs |
13081308
|------|-----------|----------------------|
1309-
| Single session (sync) | `CodeEvaluator.evaluate_session()` | Python |
1309+
| Single session (sync) | `SystemEvaluator.evaluate_session()` | Python |
13101310
| Single session (async) | `LLMAsJudge.evaluate_session()` | Gemini API |
13111311
| Batch via Client | `Client.evaluate()` | BigQuery (SQL + AI.GENERATE) |
13121312
| Trajectory matching | `BigQueryTraceEvaluator.evaluate_session()` | BigQuery (fetch) + Python (matching) |
@@ -1405,7 +1405,7 @@ Synchronous (user-facing):
14051405
├── Client.drift_detection()
14061406
├── Client.insights()
14071407
├── Client.deep_analysis()
1408-
├── CodeEvaluator.evaluate_session()
1408+
├── SystemEvaluator.evaluate_session()
14091409
├── EvalSuite.*
14101410
├── EvalValidator.*
14111411
└── BigFramesEvaluator.*
@@ -1465,10 +1465,10 @@ results = await asyncio.gather(*[_run_one(t) for t in tasks])
14651465

14661466
## 10. Extensibility & Plugin Points
14671467

1468-
### 10.1 Custom Metrics (CodeEvaluator)
1468+
### 10.1 Custom Metrics (SystemEvaluator)
14691469

14701470
```python
1471-
evaluator = CodeEvaluator(name="custom").add_metric(
1471+
evaluator = SystemEvaluator(name="custom").add_metric(
14721472
name="business_metric",
14731473
fn=lambda session: your_scoring_logic(session),
14741474
threshold=0.7,
@@ -1571,7 +1571,7 @@ All tests mock BigQuery — no GCP credentials or live BigQuery access is needed
15711571
```
15721572
tests/
15731573
├── test_sdk_client.py # Client integration tests
1574-
├── test_sdk_evaluators.py # CodeEvaluator + LLMAsJudge
1574+
├── test_sdk_evaluators.py # SystemEvaluator + LLMAsJudge
15751575
├── test_sdk_trace.py # Trace/Span reconstruction
15761576
├── test_sdk_feedback.py # Drift detection
15771577
├── test_sdk_insights.py # Insights pipeline

docs/hatteras_evaluation.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ agent sessions into user-defined categories directly against traces stored in
77
BigQuery, without relying on an external service.
88

99
This should be implemented as a new categorical evaluation subsystem, not as
10-
an overload of the existing numeric `CodeEvaluator` / `LLMAsJudge` report
10+
an overload of the existing numeric `SystemEvaluator` / `LLMAsJudge` report
1111
path.
1212

1313
The goal is to support Hatteras-like functionality inside the SDK:
@@ -22,7 +22,7 @@ The goal is to support Hatteras-like functionality inside the SDK:
2222

2323
Today the SDK supports two major evaluation modes:
2424

25-
- deterministic numeric scoring via `CodeEvaluator`
25+
- deterministic numeric scoring via `SystemEvaluator`
2626
- semantic numeric scoring via `LLMAsJudge`
2727

2828
What is missing is a first-class way to answer questions like:
@@ -60,7 +60,7 @@ That capability is useful for:
6060
This design is not proposing:
6161

6262
- a full clone of an external Hatteras service
63-
- a replacement for `CodeEvaluator`
63+
- a replacement for `SystemEvaluator`
6464
- a replacement for `LLMAsJudge`
6565
- a new remote function or Python UDF surface in the first phase
6666
- real-time ingestion-time classification in phase 1

docs/implementation_plan_concept_index_runtime.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -165,7 +165,7 @@ Work: `bigquery_ontology/contrib/advertising/` stub with Yahoo's resolver (if co
165165
- `src/bigquery_ontology/graph_ddl_compiler.py` — add `compile_concept_index(ontology, binding, *, output_table) -> str`. Preserve `compile_graph()` contract byte-identically. No changes to existing function bodies.
166166
- `src/bigquery_ontology/cli.py:299``compile` command gains `--emit-concept-index` and `--concept-index-table` flags. When absent, behavior is byte-identical to today.
167167
- `src/bigquery_ontology/__init__.py` — add `from .graph_ddl_compiler import compile_concept_index` so the new public function is importable as `from bigquery_ontology import compile_concept_index`, matching the existing pattern for `compile_graph` (`__init__.py:50` today).
168-
- `src/bigquery_agent_analytics/__init__.py` — add the new public surface to the try/except re-export block (same pattern as `Client`, `CodeEvaluator`, etc.):
168+
- `src/bigquery_agent_analytics/__init__.py` — add the new public surface to the try/except re-export block (same pattern as `Client`, `SystemEvaluator`, etc.):
169169
- `OntologyRuntime` from `.ontology_runtime`
170170
- `EntityResolver`, `ExactMatchResolver`, `SynonymResolver`, `Candidate`, `ResolveResult` from `.entity_resolver`
171171
- `ConceptIndexMismatchError`, `ConceptIndexProvenanceMissing`, `ConceptIndexInconsistentPair`, `ConceptIndexRefreshed` from `.ontology_runtime`

docs/implementation_plan_remote_function.md

Lines changed: 32 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -219,13 +219,30 @@ Dispatch logic:
219219
```python
220220
# Map CLI --evaluator to SDK factory
221221
EVALUATOR_FACTORIES = {
222-
"latency": lambda t: CodeEvaluator.latency(threshold_ms=t),
223-
"error_rate": lambda t: CodeEvaluator.error_rate(max_error_rate=t),
224-
"turn_count": lambda t: CodeEvaluator.turn_count(max_turns=int(t)),
225-
"token_efficiency": lambda t: CodeEvaluator.token_efficiency(max_tokens=int(t)),
226-
"ttft": lambda t: CodeEvaluator.ttft(threshold_ms=t),
227-
"cost": lambda t: CodeEvaluator.cost_per_session(max_cost_usd=t),
228-
"llm-judge": None, # special handling
222+
"latency": (
223+
lambda t: SystemEvaluator.latency(threshold_ms=t),
224+
lambda: SystemEvaluator.latency(),
225+
),
226+
"error_rate": (
227+
lambda t: SystemEvaluator.error_rate(max_error_rate=t),
228+
lambda: SystemEvaluator.error_rate(),
229+
),
230+
"turn_count": (
231+
lambda t: SystemEvaluator.turn_count(max_turns=int(t)),
232+
lambda: SystemEvaluator.turn_count(),
233+
),
234+
"token_efficiency": (
235+
lambda t: SystemEvaluator.token_efficiency(max_tokens=int(t)),
236+
lambda: SystemEvaluator.token_efficiency(),
237+
),
238+
"ttft": (
239+
lambda t: SystemEvaluator.ttft(threshold_ms=t),
240+
lambda: SystemEvaluator.ttft(),
241+
),
242+
"cost": (
243+
lambda t: SystemEvaluator.cost_per_session(max_cost_usd=t),
244+
lambda: SystemEvaluator.cost_per_session(),
245+
),
229246
}
230247
```
231248

@@ -289,7 +306,7 @@ import functions_framework
289306
from flask import jsonify
290307

291308
from bigquery_agent_analytics import Client, serialize
292-
from bigquery_agent_analytics import CodeEvaluator, LLMAsJudge
309+
from bigquery_agent_analytics import SystemEvaluator, LLMAsJudge
293310
from bigquery_agent_analytics import TraceFilter
294311

295312

@@ -385,18 +402,18 @@ def _dispatch(client, operation, params):
385402

386403

387404
def _build_evaluator(params):
388-
"""Build CodeEvaluator from params dict."""
405+
"""Build SystemEvaluator from params dict."""
389406
metric = params.get("metric", "latency")
390407
threshold = params.get("threshold", 5000)
391408
factories = {
392-
"latency": lambda t: CodeEvaluator.latency(threshold_ms=t),
393-
"error_rate": lambda t: CodeEvaluator.error_rate(max_error_rate=t),
394-
"turn_count": lambda t: CodeEvaluator.turn_count(max_turns=int(t)),
395-
"token_efficiency": lambda t: CodeEvaluator.token_efficiency(
409+
"latency": lambda t: SystemEvaluator.latency(threshold_ms=t),
410+
"error_rate": lambda t: SystemEvaluator.error_rate(max_error_rate=t),
411+
"turn_count": lambda t: SystemEvaluator.turn_count(max_turns=int(t)),
412+
"token_efficiency": lambda t: SystemEvaluator.token_efficiency(
396413
max_tokens=int(t)
397414
),
398-
"ttft": lambda t: CodeEvaluator.ttft(threshold_ms=t),
399-
"cost": lambda t: CodeEvaluator.cost_per_session(max_cost_usd=t),
415+
"ttft": lambda t: SystemEvaluator.ttft(threshold_ms=t),
416+
"cost": lambda t: SystemEvaluator.cost_per_session(max_cost_usd=t),
400417
}
401418
factory = factories.get(metric)
402419
if not factory:

docs/prd_unified_analytics_interface.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -109,7 +109,7 @@ All operations go through a single multiplexed function:
109109
| Operation | SDK Method | Params (JSON keys) | Output |
110110
|-----------|-----------|---------------------|--------|
111111
| `analyze` | `Client.get_session_trace()` + metrics | `session_id` | JSON with span count, error count, latency, tool calls |
112-
| `evaluate` | `CodeEvaluator` | `session_id`, `metric`, `threshold` | JSON with passed, score, details |
112+
| `evaluate` | `SystemEvaluator` | `session_id`, `metric`, `threshold` | JSON with passed, score, details |
113113
| `judge` | `LLMAsJudge` | `session_id`, `criterion` | JSON with score, feedback |
114114
| `insights` | Facet extraction | `session_id` | JSON with intent, outcome, friction |
115115
| `drift` | Drift detection | `golden_dataset`, `agent_filter`, `start_date`, `end_date` | JSON with coverage, gaps |
@@ -443,7 +443,7 @@ import functions_framework
443443
import json
444444
import os
445445
from flask import jsonify
446-
from bigquery_agent_analytics import Client, CodeEvaluator, LLMAsJudge, TraceFilter
446+
from bigquery_agent_analytics import Client, SystemEvaluator, LLMAsJudge, TraceFilter
447447

448448
# Initialized once per cold start. Config comes from userDefinedContext
449449
# (forwarded by BigQuery) or environment variables as fallback.
@@ -490,7 +490,7 @@ def _dispatch(client, operation, params):
490490
"final_response": trace.final_response,
491491
}
492492
elif operation == "evaluate":
493-
evaluator = CodeEvaluator.latency(threshold_ms=params["threshold"])
493+
evaluator = SystemEvaluator.latency(threshold_ms=params["threshold"])
494494
report = client.evaluate(evaluator=evaluator,
495495
filters=TraceFilter(session_ids=[params["session_id"]]))
496496
return report.details[0] if report.details else {}

docs/python_udf_support_design.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -172,7 +172,7 @@ primitive:
172172

173173
| SDK area | Python UDF fit | Required redesign |
174174
|----------|----------------|-------------------|
175-
| `Client.evaluate(CodeEvaluator, filters)` | Partial | SQL builds per-session summaries first; UDF computes scores from summary fields |
175+
| `Client.evaluate(SystemEvaluator, filters)` | Partial | SQL builds per-session summaries first; UDF computes scores from summary fields |
176176
| `Client.deep_analysis()` / question distribution | Partial | SQL does grouping / embeddings / top-k; UDF can help with categorization or normalization |
177177
| `Client.drift_detection()` | Partial | SQL computes set logic; UDF may help with text normalization or thresholding |
178178
| `Client.insights()` | Partial | Best split into SQL extraction + optional UDF post-processing; not a direct port |
@@ -224,7 +224,7 @@ That is maintainable. Reusing the entire client inside a Python UDF is not.
224224

225225
The current evaluator score math is not implemented as standalone top-level
226226
functions today. It lives inside factory-method closures such as
227-
`CodeEvaluator.latency()` and `CodeEvaluator.error_rate()` in
227+
`SystemEvaluator.latency()` and `SystemEvaluator.error_rate()` in
228228
[evaluators.py](/Users/haiyuancao/BigQuery-Agent-Analytics-SDK/src/bigquery_agent_analytics/evaluators.py).
229229

230230
That means the first implementation step is a deliberate refactor:
@@ -281,7 +281,7 @@ the shared extraction helper.
281281

282282
### 7.2 Tier 2: code-evaluator score kernels
283283

284-
These should map directly to the existing `CodeEvaluator` math:
284+
These should map directly to the existing `SystemEvaluator` math:
285285

286286
```sql
287287
CREATE FUNCTION `PROJECT.UDF_DATASET.bqaa_score_latency`(
@@ -497,7 +497,7 @@ Remote Function should still be described as:
497497
- Add `udf_kernels.py`
498498
- Move reusable evaluator math into standalone pure functions
499499
- Move reusable event semantic helpers into a UDF-safe layer
500-
- Add unit tests proving parity with existing `CodeEvaluator` behavior
500+
- Add unit tests proving parity with existing `SystemEvaluator` behavior
501501

502502
### Phase U2: Tier 1 and Tier 2 UDFs
503503

0 commit comments

Comments
 (0)