You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Unify One-Sided & Side-by-Side Performance Metrics in the PerformanceEvaluator, but don't add new metrics to the MultiTrialPerformance Evaluator yet
Purged obsolete criteria-list LLMAsJudge implementations, replacing them natively with PerformanceEvaluator for folded Tone, Faithfulness, Correctness, and Efficiency evaluations.
- Decoupled system and performance modules cleanly, making system_evaluator.py pure to SystemEvaluator.
- Overrode the backwards-compatible LLMAsJudge subclass in evaluators.py with required static factories for correctness, hallucination, and sentiment.
- PURGED criteria-list BQML execution code from client.py, and deleted legacy _criteria and _JudgeCriterion list validations throughout test suites.
- Fixed Jupyter event-loop context constraints via robust asyncio running event-loop setters inside Client._evaluate_performance.
- Refactored strip_markdown_fences in utils.py to drop trailing prose after fenced markdown closing backticks cleanly.
- Verified 1,997 collected unit tests PASSING 100% green successfully.
TAG=agy
CONV=bf5607ce-a7fc-4a29-a7fb-c6074580e613
Copy file name to clipboardExpand all lines: SDK.md
+32-39Lines changed: 32 additions & 39 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -206,44 +206,37 @@ print(report.summary())
206
206
207
207
---
208
208
209
-
## 4. LLM-as-Judge Evaluation (Semantic Metrics)
209
+
## 4. PerformanceEvaluator (Semantic Metrics)
210
210
211
-
`LLMAsJudge` uses an LLM to score agent responses against semantic criteria. Evaluations run either via BigQuery AI.GENERATE (zero-ETL) or the Gemini API.
211
+
`PerformanceEvaluator` uses Gemini models to evaluate trace performance and agent responses against folded semantic criteria: Correctness, Sentiment, Faithfulness (Hallucination), and Efficiency.
212
212
213
-
### Pre-Built Judges
213
+
### Folded Factories (Backwards Compatible)
214
+
215
+
The SDK provides pre-built factories for semantic criteria that map transparently to `PerformanceEvaluator` for drop-in backwards compatibility:
214
216
215
217
```python
216
-
from bigquery_agent_analytics importLLMAsJudge
218
+
from bigquery_agent_analytics importPerformanceEvaluator
217
219
218
220
# Correctness: did the agent provide accurate, factual answers?
`BigQueryTraceEvaluator` evaluates agent behavior against expected tool-call trajectories stored in BigQuery. It supports three matching modes and optional LLM-as-judge scoring.
331
+
`PerformanceEvaluator` evaluates agent behavior against expected tool-call trajectories stored in BigQuery. It supports three matching modes and optional LLM-as-judge scoring.
339
332
340
333
### Match Types
341
334
@@ -348,10 +341,10 @@ The `details` dict on `EvaluationReport` holds operational metadata that is sepa
348
341
### Evaluate Against a Golden Trajectory
349
342
350
343
```python
351
-
from bigquery_agent_analytics importBigQueryTraceEvaluator
344
+
from bigquery_agent_analytics importPerformanceEvaluator
352
345
from bigquery_agent_analytics.trace_evaluator import MatchType
353
346
354
-
evaluator =BigQueryTraceEvaluator(
347
+
evaluator =PerformanceEvaluator(
355
348
project_id="my-project",
356
349
dataset_id="agent_analytics",
357
350
# Optional: filter which event types are fetched from BigQuery.
@@ -471,9 +464,9 @@ Agents are non-deterministic -- a single evaluation run is not statistically mea
471
464
### Run Multi-Trial Evaluation
472
465
473
466
```python
474
-
from bigquery_agent_analytics importBigQueryTraceEvaluator, TrialRunner
467
+
from bigquery_agent_analytics importPerformanceEvaluator, TrialRunner
Each field generates a separate `AND` condition with a corresponding `bigquery.ScalarQueryParameter` or `bigquery.ArrayQueryParameter`. This is the **only** dynamic SQL in the SDK — everything else uses static templates.
0 commit comments