You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Unify One-Sided & Side-by-Side Performance Metrics in the PerformanceEvaluator, but don't add new metrics to the MultiTrialPerformance Evaluator yet
Purged obsolete criteria-list LLMAsJudge implementations, replacing them natively with PerformanceEvaluator for folded Tone, Faithfulness, Correctness, and Efficiency evaluations.
- Decoupled system and performance modules cleanly, making system_evaluator.py pure to SystemEvaluator.
- Overrode the backwards-compatible LLMAsJudge subclass in evaluators.py with required static factories for correctness, hallucination, and sentiment.
- PURGED criteria-list BQML execution code from client.py, and deleted legacy _criteria and _JudgeCriterion list validations throughout test suites.
- Fixed Jupyter event-loop context constraints via robust asyncio running event-loop setters inside Client._evaluate_performance.
- Refactored strip_markdown_fences in utils.py to drop trailing prose after fenced markdown closing backticks cleanly.
- Verified 1,997 collected unit tests PASSING 100% green successfully.
TAG=agy
CONV=bf5607ce-a7fc-4a29-a7fb-c6074580e613
`LLMAsJudge` uses an LLM to score agent responses against semantic criteria. Evaluations run either via BigQuery AI.GENERATE (zero-ETL) or the Gemini API.
211
+
`PerformanceEvaluator` uses deterministic methods and Gemini models to evaluate trace
212
+
performance and agent responses against performance criteria: Correctness, Sentiment, Faithfulness (Hallucination), and Efficiency.
212
213
213
-
### Pre-Built Judges
214
+
### Native Performance Evaluations
214
215
215
-
```python
216
-
from bigquery_agent_analytics import LLMAsJudge
217
-
218
-
# Correctness: did the agent provide accurate, factual answers?
219
-
judge = LLMAsJudge.correctness(threshold=0.7)
220
-
221
-
# Hallucination: does the response contain unsupported claims?
222
-
judge = LLMAsJudge.hallucination(threshold=0.6)
223
-
224
-
# Sentiment: was the interaction positive and helpful?
225
-
judge = LLMAsJudge.sentiment(threshold=0.5)
226
-
```
227
-
228
-
### Custom Judge Criteria
229
-
230
-
Define custom evaluation criteria with your own prompt templates:
216
+
For holistic performance checks, construct `PerformanceEvaluator` directly to execute correctness and efficiency evaluations recursively:
`BigQueryTraceEvaluator` evaluates agent behavior against expected tool-call trajectories stored in BigQuery. It supports three matching modes and optional LLM-as-judge scoring.
315
+
`PerformanceEvaluator` evaluates agent behavior against expected tool-call trajectories stored in BigQuery. It supports three matching modes and optional LLM-as-judge scoring.
339
316
340
317
### Match Types
341
318
@@ -348,10 +325,10 @@ The `details` dict on `EvaluationReport` holds operational metadata that is sepa
348
325
### Evaluate Against a Golden Trajectory
349
326
350
327
```python
351
-
from bigquery_agent_analytics importBigQueryTraceEvaluator
328
+
from bigquery_agent_analytics importPerformanceEvaluator
352
329
from bigquery_agent_analytics.trace_evaluator import MatchType
353
330
354
-
evaluator =BigQueryTraceEvaluator(
331
+
evaluator =PerformanceEvaluator(
355
332
project_id="my-project",
356
333
dataset_id="agent_analytics",
357
334
# Optional: filter which event types are fetched from BigQuery.
@@ -471,9 +448,9 @@ Agents are non-deterministic -- a single evaluation run is not statistically mea
471
448
### Run Multi-Trial Evaluation
472
449
473
450
```python
474
-
from bigquery_agent_analytics importBigQueryTraceEvaluator, TrialRunner
451
+
from bigquery_agent_analytics importPerformanceEvaluator, TrialRunner
Each field generates a separate `AND` condition with a corresponding `bigquery.ScalarQueryParameter` or `bigquery.ArrayQueryParameter`. This is the **only** dynamic SQL in the SDK — everything else uses static templates.
0 commit comments