Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 7 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,10 +25,9 @@ regressions — all through BigQuery SQL or Python.
- Observability dashboards (SQL and BigFrames)

**Evaluation**
- Code-based metrics (latency, turn count, error rate, token efficiency, cost)
- LLM-as-Judge scoring (correctness, hallucination, sentiment)
- Trajectory matching (exact, in-order, any-order)
- Multi-trial evaluation with pass@k / pass^k
- System metrics (latency, turn count, tool call error rate, token efficiency, time to first token, cost)
- Performance Metrics (correctness, hallucination, sentiment, efficiency, etc)
- Multi-trial system and performance metircs
- Grader composition (weighted, binary, majority strategies)
- Eval suite lifecycle management with graduation and saturation detection
- Static quality validation (ambiguous tasks, class imbalance, suspicious thresholds)
Expand Down Expand Up @@ -123,10 +122,10 @@ src/bigquery_agent_analytics/
│ └── formatter.py # Output formatting (json/text/table)
├── Evaluation
│ ├── evaluators.py # CodeEvaluator + LLMAsJudge
│ ├── trace_evaluator.py # Trajectory matching & replay
│ ├── multi_trial.py # Multi-trial runner + pass@k
── grader_pipeline.py # Grader composition pipeline
│ ├── system_evaluator.py # SystemEvaluator
│ ├── performance_evaluator.py # PerformanceEvaluator
│ ├── multi_trial_performance_evaluator.py # MultiTrialPerformanceEvaluator
── aggregate_grader.py # AggregateGrader
│ ├── eval_suite.py # Eval suite lifecycle management
│ └── eval_validator.py # Static validation checks
Expand Down
Loading