Skip to content

feat: implement three-pillar scorecard rubric templates (#63 Phase 1)#91

Open
Gayathri0105RK wants to merge 1 commit intoGoogleCloudPlatform:mainfrom
Gayathri0105RK:main
Open

feat: implement three-pillar scorecard rubric templates (#63 Phase 1)#91
Gayathri0105RK wants to merge 1 commit intoGoogleCloudPlatform:mainfrom
Gayathri0105RK:main

Conversation

@Gayathri0105RK
Copy link
Copy Markdown

Closes #63

This PR implements Phase 1 of the Quality Scorecard roadmap.

Key Technical Alignments:

Vocabulary Reuse: Implements response_usefulness and task_grounding using existing category names to ensure dashboard compatibility.

Net-New GRC: Adds the policy_compliance metric with compliant/violation categories.

Strategy: Adopts Option A (3-bucket categorical) instead of 1-5 scoring.

Schema-Free: This PR does not modify any BigQuery tables or add new columns like root_agent_name

Add initial Phase 1 rubric templates
@caohy1988
Copy link
Copy Markdown
Contributor

Review against issue #63 + current upstream main

The PR is on the right track per the per-PR plan I outlined in issue #63: scoped to Phase 1 — rubric factories only, no schema migration, no leaderboard, no HITL triage. That ordering is correct. CI's only signal is cla/google passing; no automated checks against this small a diff.

A few things to fix before merge — most importantly, the definition strings diverge from the canonical ones already shipping in scripts/quality_report.py, which silently defeats the dashboard-compatibility claim in the PR body.


High — Category names match canonical, but definitions diverge

The PR claims "vocabulary reuse" and dashboard compatibility. The category names match (meaningful / partial / unhelpful for response_usefulness; grounded / ungrounded / no_tool_needed for task_grounding). But the definition strings differ substantially from what's in scripts/quality_report.py:155-228:

Field Canonical (scripts/quality_report.py) This PR
response_usefulness.definition "Whether the agent's final response provides a genuinely useful, substantive answer to the user's question. A response that apologizes, says it cannot help, returns no data, provides only generic filler, or loops without resolving the question is NOT useful." (~50 words) "Evaluate if the response was meaningful, partial, or unhelpful." (10 words)
unhelpful.definition "The response technically succeeded (no error) but does NOT meaningfully answer the user's question. Examples: apologies, 'I don't have that information', empty data results, generic filler text, or the agent looping without a resolution." "Did not help the user."
task_grounding.definition "Whether the agent's response is grounded in actual data retrieved from its tools, or is fabricated / hallucinated general knowledge." "Check if the agent used tools correctly and avoided hallucinations."
ungrounded.definition "The response appears to be fabricated or based on the LLM's general knowledge rather than actual tool results. The tool may have returned empty data and the agent filled in anyway." "Contains hallucinations."

Why this matters. CategoricalMetricDefinition.definition and each CategoricalMetricCategory.definition get inlined into the LLM judge prompt at evaluation time (see categorical_evaluator.py SQL templates). Different definition text → different judge behavior → different category distributions on the same trace data. The persisted categorical_results rows from this rubric would be classified differently than rows persisted from quality_report.get_eval_metrics(), even though the category labels match. So a downstream dashboard grouping by (metric_name, category) would mix two populations under the same labels — exactly the divergence the "vocabulary reuse" promise was meant to prevent.

Fix. Establish a single source of truth. Two acceptable approaches:

  1. (Recommended) Move the canonical response_usefulness + task_grounding definitions from scripts/quality_report.py:155-228 into src/bigquery_agent_analytics/evaluation_rubrics.py, then have quality_report.get_eval_metrics() import them. The new module becomes the canonical home; quality_report.py is a thin caller.
  2. Have evaluation_rubrics.response_usefulness_metric() import from scripts.quality_report. Reverses the natural dependency direction (src/scripts/) and is fragile if scripts/ becomes optional.

Either way, the rubric factories must return objects byte-equivalent to quality_report.get_eval_metrics(). A test should pin this.


Medium — No tests

The whole point of these factories is to be a stable, reusable contract. The PR adds zero tests. At minimum:

def test_response_usefulness_matches_quality_report_canonical():
  from scripts.quality_report import get_eval_metrics
  from bigquery_agent_analytics.evaluation_rubrics import response_usefulness_metric

  canonical = next(m for m in get_eval_metrics() if m.name == "response_usefulness")
  rubric = response_usefulness_metric()
  assert rubric.model_dump() == canonical.model_dump()


def test_task_grounding_matches_quality_report_canonical():
  ...  # same pattern


def test_policy_compliance_categories():
  m = policy_compliance_metric()
  assert {c.name for c in m.categories} == {"compliant", "violation"}
  assert all(c.definition for c in m.categories)

Without these, the High above could regress invisibly.


Medium — policy_compliance definition mixes three distinct mechanisms

The policy_compliance metric's definition field reads:

"Check for PII leakage, tone, and authorized tool usage."

These are three independent policy mechanisms with different cost/latency/false-positive profiles:

  • PII leakage — typically a regex/classifier over the response text.
  • Tone — LLM-judged on the response.
  • Authorized tool usage — deterministic check over the tool-call trajectory (better suited to CodeEvaluator, not categorical).

Asking the judge to render one of compliant / violation for any of three concerns will produce noisy classifications and unactionable triage signals. Worth narrowing to one concern for v1 (suggest PII via AI.GENERATE) or splitting into three metrics with separate names. Issue #63's review thread flagged this exact scope-creep risk; the original recommendation was to ship Phase 1 as just PII compliance.


Low — Construction style + import hygiene

  • Categories are constructed as dict literals ({"name": "...", "definition": "..."}). Pydantic coerces these into CategoricalMetricCategory instances, so it works, but the rest of the codebase uses explicit CategoricalMetricCategory(name=..., definition=...) (see scripts/quality_report.py:155-228). Match the existing style.
  • CategoricalMetricCategory is not imported. Add it:
    from bigquery_agent_analytics.categorical_evaluator import (
        CategoricalMetricCategory,
        CategoricalMetricDefinition,
    )
  • File ends without a final newline (\ No newline at end of file in the diff). Minor lint.
  • Per-factory docstrings are one-liners ("Existing SDK pillar for Helpfulness."). Worth expanding to name the canonical source they mirror, when to use them, and the dashboard view they're compatible with.

Low — No package-root re-export, no documented import path

For users to consume these factories, they need an import path. Two acceptable answers:

  1. Re-export from bigquery_agent_analytics/__init__.py:
    from .evaluation_rubrics import (
        response_usefulness_metric,
        task_grounding_metric,
        policy_compliance_metric,
    )
  2. Document the explicit module path in scripts/README.md or a new docs/quality_scorecard.md.

Without one of these, the factories exist but no one knows how to call them. Issue #63's per-PR plan suggested the explicit-module-path approach with a doc page; either works.


Out of scope (tracked, not blocking)

Per the per-PR plan from issue #63 these belong to follow-up PRs:

  • PR 2 — schema migration to add root_agent_name (+ region if pinned) to categorical_results. The PR description correctly calls this out as deferred.
  • PR 3categorical_fleet_leaderboard view on top of Revamp README, enhance documentation navigation, and fix CI #2.
  • PR 4triage_low_score_sessions() (better named triage_categorical_sessions) — the design note for that PR should answer the four open questions: queue schema, idempotency on re-scoring, lifecycle, and HITL event relationship.
  • Likert→3-bucket delta documented in docs/quality_scorecard.md (or wherever the canonical user-facing doc lands).

Verdict

Right scope, right strategy, right module location. The substantive blocker is finding 1: the definition strings have to match the canonical ones byte-for-byte (or be the canonical source themselves) for the dashboard-compatibility claim to hold. With the High addressed + at least one byte-equivalence test, this is a clean Phase 1 merge.

@caohy1988
Copy link
Copy Markdown
Contributor

Fresh pass against PR #91 + current main: I agree with the earlier review. The PR is in the right Phase 1 shape, but the canonical-rubric issue is still the main blocker.

High

  • The metric/category names reuse the canonical vocabulary, but the definitions and category order do not. This is behaviorally significant because CategoricalMetricCategory.definition is passed into the classifier prompt.

    Concretely, scripts/quality_report.py defines:

    • response_usefulness with the longer usefulness definition and category order meaningful, unhelpful, partial
    • task_grounding with the longer grounding definition and category order grounded, ungrounded, no_tool_needed

    PR feat: implement three-pillar scorecard rubric templates (#63 Phase 1) #91 defines shorter replacement text and uses response_usefulness order meaningful, partial, unhelpful. Since the query builder preserves category order when building the SQL category literal, tests should compare full serialized metric definitions, not only sets of names.

    Suggested fix: make src/bigquery_agent_analytics/evaluation_rubrics.py the single source of truth for the canonical response_usefulness and task_grounding definitions, then update scripts/quality_report.py::get_eval_metrics() to import those factories. Add byte-equivalence tests against the old quality_report definitions so this cannot drift again.

Medium

  • The new file has no tests. Minimum useful test: assert response_usefulness_metric().model_dump() and task_grounding_metric().model_dump() exactly match the canonical quality_report.get_eval_metrics() objects after the source-of-truth refactor. Include category order in the assertion.

  • policy_compliance_metric() still bundles PII leakage, tone, and authorized tool usage into one binary metric. That is likely noisy because these are different mechanisms with different evidence sources. If Phase 1 is intended to be a small categorical preset PR, I would scope this to one clearly defined compliance check or document that this is only a coarse placeholder.

Low

  • src/bigquery_agent_analytics/evaluation_rubrics.py is missing the repo-standard copyright/license header.

  • Format is not clean locally: uv run pyink --config pyproject.toml --check src/bigquery_agent_analytics/evaluation_rubrics.py reports the file would be reformatted. The file also has no final newline.

  • Construction should match local style: import and instantiate CategoricalMetricCategory explicitly instead of relying on Pydantic dict coercion.

  • I would soften the package-root re-export concern from the previous review: a root re-export is not strictly required if the intended import path is bigquery_agent_analytics.evaluation_rubrics. But the intended public import path should be documented and tested.

Net: right direction, but please avoid shipping same labels with different judge definitions. That creates the exact dashboard/reporting divergence this Phase 1 rubric PR is meant to prevent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Proposal: Automated Agent Quality Scorecard

2 participants