Skip to content

Commit 806dba8

Browse files
Merge pull request #54 from evekhm/feat/quality-report
Add quality evaluation report script
2 parents cd0940a + a95ad44 commit 806dba8

6 files changed

Lines changed: 1672 additions & 0 deletions

File tree

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,9 @@ venv/
1313
env/
1414
uv.lock
1515

16+
# Script outputs
17+
scripts/reports/
18+
1619
# Local workspace metadata
1720
.code*/
1821
deploy/streaming_evaluation/.streaming_evaluation_state.json

scripts/README.md

Lines changed: 95 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,95 @@
1+
# Scripts
2+
3+
Standalone scripts for the BigQuery Agent Analytics SDK.
4+
5+
## Quality Report
6+
7+
Runs LLM-as-a-judge evaluation over agent sessions stored in BigQuery
8+
and produces a quality report with per-agent breakdown, unhelpful session
9+
analysis, and category distributions.
10+
11+
### Prerequisites
12+
13+
- Python 3.11+
14+
- BigQuery Agent Analytics SDK installed (`pip install bigquery-agent-analytics`)
15+
- GCP authentication configured (`gcloud auth application-default login`)
16+
- Agent traces already stored in a BigQuery table
17+
18+
### Environment Variables
19+
20+
Create a `.env` file in the repo root or export these variables:
21+
22+
| Variable | Required | Description |
23+
|----------|----------|-------------|
24+
| `PROJECT_ID` | Yes | GCP project containing the traces table |
25+
| `DATASET_ID` | Yes | BigQuery dataset name |
26+
| `TABLE_ID` | Yes | BigQuery table name (e.g. `agent_events`) |
27+
| `DATASET_LOCATION` | Yes | BigQuery dataset location (e.g. `us-central1`) |
28+
| `EVAL_MODEL_ID` | No | Model for evaluation (default: `gemini-2.5-flash`) |
29+
| `GOOGLE_CLOUD_PROJECT` | No | GCP project for Vertex AI (defaults to `PROJECT_ID`) |
30+
| `GOOGLE_CLOUD_LOCATION` | No | Vertex AI location (default: `global`) |
31+
32+
Example `.env`:
33+
34+
```bash
35+
PROJECT_ID=my-gcp-project
36+
DATASET_ID=agent_logs
37+
TABLE_ID=agent_events
38+
DATASET_LOCATION=us-central1
39+
EVAL_MODEL_ID=gemini-2.5-flash
40+
```
41+
42+
### Usage
43+
44+
```bash
45+
# From the repo root:
46+
./scripts/quality_report.sh # evaluate last 100 sessions
47+
./scripts/quality_report.sh --limit 500 # evaluate last 500 sessions
48+
./scripts/quality_report.sh --time-period 7d # evaluate last 7 days
49+
./scripts/quality_report.sh --report # also generate markdown report
50+
./scripts/quality_report.sh --no-eval # browse Q&A only (no evaluation)
51+
./scripts/quality_report.sh --persist # persist results to BigQuery
52+
./scripts/quality_report.sh --model gemini-2.5-pro # use a specific model
53+
./scripts/quality_report.sh --samples 20 # show 20 sessions per category
54+
./scripts/quality_report.sh --samples all # show all sessions per category
55+
```
56+
57+
Or run the Python script directly:
58+
59+
```bash
60+
python scripts/quality_report.py --limit 50 --report
61+
```
62+
63+
### Output
64+
65+
**Console output** includes:
66+
- Per-session details grouped by category (unhelpful, partial, meaningful)
67+
- Per-agent quality table with helpful/unhelpful rates and status indicators
68+
- Unhelpful contribution ranking
69+
- Category distributions
70+
- Execution details (elapsed time, execution mode)
71+
72+
**Markdown report** (`--report` flag) is saved to `scripts/reports/` and includes
73+
all the above in a structured markdown format suitable for sharing or archiving.
74+
75+
**Log files** are saved to `scripts/reports/` for each eval run.
76+
77+
### Metrics
78+
79+
The evaluation uses two categorical metrics:
80+
81+
- **response_usefulness** - Whether the agent's response provides a genuinely
82+
useful answer. Categories: `meaningful`, `unhelpful`, `partial`.
83+
84+
- **task_grounding** - Whether the response is grounded in tool-retrieved data
85+
or fabricated. Categories: `grounded`, `ungrounded`, `no_tool_needed`.
86+
87+
### A2A Support
88+
89+
The script automatically detects and resolves responses from remote A2A
90+
(Agent-to-Agent) agents by extracting `A2A_INTERACTION` events from traces.
91+
92+
93+
### Sample report output
94+
95+
[Sample report output](sample_report.md)

0 commit comments

Comments
 (0)