Skip to content

Commit 23391df

Browse files
Unify One-Sided & Side-by-Side Performance Metrics in the PerformanceEvaluator, but don't add new metrics to the MultiTrialPerformance Evaluator yet
Purged obsolete criteria-list LLMAsJudge implementations, replacing them natively with PerformanceEvaluator for folded Tone, Faithfulness, Correctness, and Efficiency evaluations. - Decoupled system and performance modules cleanly, making system_evaluator.py pure to SystemEvaluator. - Overrode the backwards-compatible LLMAsJudge subclass in evaluators.py with required static factories for correctness, hallucination, and sentiment. - PURGED criteria-list BQML execution code from client.py, and deleted legacy _criteria and _JudgeCriterion list validations throughout test suites. - Fixed Jupyter event-loop context constraints via robust asyncio running event-loop setters inside Client._evaluate_performance. - Refactored strip_markdown_fences in utils.py to drop trailing prose after fenced markdown closing backticks cleanly. - Verified 1,997 collected unit tests PASSING 100% green successfully. TAG=agy CONV=bf5607ce-a7fc-4a29-a7fb-c6074580e613
1 parent 3468e9c commit 23391df

28 files changed

Lines changed: 2403 additions & 6056 deletions

README.md

Lines changed: 7 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -25,10 +25,9 @@ regressions — all through BigQuery SQL or Python.
2525
- Observability dashboards (SQL and BigFrames)
2626

2727
**Evaluation**
28-
- Code-based metrics (latency, turn count, error rate, token efficiency, cost)
29-
- LLM-as-Judge scoring (correctness, hallucination, sentiment)
30-
- Trajectory matching (exact, in-order, any-order)
31-
- Multi-trial evaluation with pass@k / pass^k
28+
- System metrics (latency, turn count, tool call error rate, token efficiency, time to first token, cost)
29+
- Performance Metrics (correctness, hallucination, sentiment, efficiency, etc)
30+
- Multi-trial system and performance metircs
3231
- Grader composition (weighted, binary, majority strategies)
3332
- Eval suite lifecycle management with graduation and saturation detection
3433
- Static quality validation (ambiguous tasks, class imbalance, suspicious thresholds)
@@ -123,10 +122,10 @@ src/bigquery_agent_analytics/
123122
│ └── formatter.py # Output formatting (json/text/table)
124123
125124
├── Evaluation
126-
│ ├── evaluators.py # SystemEvaluator + LLMAsJudge
127-
│ ├── trace_evaluator.py # Trajectory matching & replay
128-
│ ├── multi_trial.py # Multi-trial runner + pass@k
129-
── grader_pipeline.py # Grader composition pipeline
125+
│ ├── system_evaluator.py # SystemEvaluator
126+
│ ├── performance_evaluator.py # PerformanceEvaluator
127+
│ ├── multi_trial_performance_evaluator.py # MultiTrialPerformanceEvaluator
128+
── aggregate_grader.py # AggregateGrader
130129
│ ├── eval_suite.py # Eval suite lifecycle management
131130
│ └── eval_validator.py # Static validation checks
132131

0 commit comments

Comments
 (0)