feat: implement three-pillar scorecard rubric templates (#63 Phase 1)#91
feat: implement three-pillar scorecard rubric templates (#63 Phase 1)#91Gayathri0105RK wants to merge 1 commit intoGoogleCloudPlatform:mainfrom
Conversation
Add initial Phase 1 rubric templates
Review against issue #63 + current upstream
|
| Field | Canonical (scripts/quality_report.py) |
This PR |
|---|---|---|
response_usefulness.definition |
"Whether the agent's final response provides a genuinely useful, substantive answer to the user's question. A response that apologizes, says it cannot help, returns no data, provides only generic filler, or loops without resolving the question is NOT useful." (~50 words) | "Evaluate if the response was meaningful, partial, or unhelpful." (10 words) |
unhelpful.definition |
"The response technically succeeded (no error) but does NOT meaningfully answer the user's question. Examples: apologies, 'I don't have that information', empty data results, generic filler text, or the agent looping without a resolution." | "Did not help the user." |
task_grounding.definition |
"Whether the agent's response is grounded in actual data retrieved from its tools, or is fabricated / hallucinated general knowledge." | "Check if the agent used tools correctly and avoided hallucinations." |
ungrounded.definition |
"The response appears to be fabricated or based on the LLM's general knowledge rather than actual tool results. The tool may have returned empty data and the agent filled in anyway." | "Contains hallucinations." |
Why this matters. CategoricalMetricDefinition.definition and each CategoricalMetricCategory.definition get inlined into the LLM judge prompt at evaluation time (see categorical_evaluator.py SQL templates). Different definition text → different judge behavior → different category distributions on the same trace data. The persisted categorical_results rows from this rubric would be classified differently than rows persisted from quality_report.get_eval_metrics(), even though the category labels match. So a downstream dashboard grouping by (metric_name, category) would mix two populations under the same labels — exactly the divergence the "vocabulary reuse" promise was meant to prevent.
Fix. Establish a single source of truth. Two acceptable approaches:
- (Recommended) Move the canonical
response_usefulness+task_groundingdefinitions fromscripts/quality_report.py:155-228intosrc/bigquery_agent_analytics/evaluation_rubrics.py, then havequality_report.get_eval_metrics()import them. The new module becomes the canonical home;quality_report.pyis a thin caller. - Have
evaluation_rubrics.response_usefulness_metric()import fromscripts.quality_report. Reverses the natural dependency direction (src/↔scripts/) and is fragile ifscripts/becomes optional.
Either way, the rubric factories must return objects byte-equivalent to quality_report.get_eval_metrics(). A test should pin this.
Medium — No tests
The whole point of these factories is to be a stable, reusable contract. The PR adds zero tests. At minimum:
def test_response_usefulness_matches_quality_report_canonical():
from scripts.quality_report import get_eval_metrics
from bigquery_agent_analytics.evaluation_rubrics import response_usefulness_metric
canonical = next(m for m in get_eval_metrics() if m.name == "response_usefulness")
rubric = response_usefulness_metric()
assert rubric.model_dump() == canonical.model_dump()
def test_task_grounding_matches_quality_report_canonical():
... # same pattern
def test_policy_compliance_categories():
m = policy_compliance_metric()
assert {c.name for c in m.categories} == {"compliant", "violation"}
assert all(c.definition for c in m.categories)Without these, the High above could regress invisibly.
Medium — policy_compliance definition mixes three distinct mechanisms
The policy_compliance metric's definition field reads:
"Check for PII leakage, tone, and authorized tool usage."
These are three independent policy mechanisms with different cost/latency/false-positive profiles:
- PII leakage — typically a regex/classifier over the response text.
- Tone — LLM-judged on the response.
- Authorized tool usage — deterministic check over the tool-call trajectory (better suited to
CodeEvaluator, not categorical).
Asking the judge to render one of compliant / violation for any of three concerns will produce noisy classifications and unactionable triage signals. Worth narrowing to one concern for v1 (suggest PII via AI.GENERATE) or splitting into three metrics with separate names. Issue #63's review thread flagged this exact scope-creep risk; the original recommendation was to ship Phase 1 as just PII compliance.
Low — Construction style + import hygiene
- Categories are constructed as dict literals (
{"name": "...", "definition": "..."}). Pydantic coerces these intoCategoricalMetricCategoryinstances, so it works, but the rest of the codebase uses explicitCategoricalMetricCategory(name=..., definition=...)(seescripts/quality_report.py:155-228). Match the existing style. CategoricalMetricCategoryis not imported. Add it:from bigquery_agent_analytics.categorical_evaluator import ( CategoricalMetricCategory, CategoricalMetricDefinition, )
- File ends without a final newline (
\ No newline at end of filein the diff). Minor lint. - Per-factory docstrings are one-liners ("Existing SDK pillar for Helpfulness."). Worth expanding to name the canonical source they mirror, when to use them, and the dashboard view they're compatible with.
Low — No package-root re-export, no documented import path
For users to consume these factories, they need an import path. Two acceptable answers:
- Re-export from
bigquery_agent_analytics/__init__.py:from .evaluation_rubrics import ( response_usefulness_metric, task_grounding_metric, policy_compliance_metric, )
- Document the explicit module path in
scripts/README.mdor a newdocs/quality_scorecard.md.
Without one of these, the factories exist but no one knows how to call them. Issue #63's per-PR plan suggested the explicit-module-path approach with a doc page; either works.
Out of scope (tracked, not blocking)
Per the per-PR plan from issue #63 these belong to follow-up PRs:
- PR 2 — schema migration to add
root_agent_name(+regionif pinned) tocategorical_results. The PR description correctly calls this out as deferred. - PR 3 —
categorical_fleet_leaderboardview on top of Revamp README, enhance documentation navigation, and fix CI #2. - PR 4 —
triage_low_score_sessions()(better namedtriage_categorical_sessions) — the design note for that PR should answer the four open questions: queue schema, idempotency on re-scoring, lifecycle, and HITL event relationship. - Likert→3-bucket delta documented in
docs/quality_scorecard.md(or wherever the canonical user-facing doc lands).
Verdict
Right scope, right strategy, right module location. The substantive blocker is finding 1: the definition strings have to match the canonical ones byte-for-byte (or be the canonical source themselves) for the dashboard-compatibility claim to hold. With the High addressed + at least one byte-equivalence test, this is a clean Phase 1 merge.
|
Fresh pass against PR #91 + current High
Medium
Low
Net: right direction, but please avoid shipping same labels with different judge definitions. That creates the exact dashboard/reporting divergence this Phase 1 rubric PR is meant to prevent. |
Closes #63
This PR implements Phase 1 of the Quality Scorecard roadmap.
Key Technical Alignments:
Vocabulary Reuse: Implements response_usefulness and task_grounding using existing category names to ensure dashboard compatibility.
Net-New GRC: Adds the policy_compliance metric with compliant/violation categories.
Strategy: Adopts Option A (3-bucket categorical) instead of 1-5 scoring.
Schema-Free: This PR does not modify any BigQuery tables or add new columns like root_agent_name