Hatteras-Style Categorical Evaluation Design

Summary

Add a new categorical evaluation surface to the SDK so users can classify agent sessions into user-defined categories directly against traces stored in BigQuery, without relying on an external service.

This should be implemented as a new categorical evaluation subsystem, not as an overload of the existing numeric SystemEvaluator / LLMAsJudge report path.

The goal is to support Hatteras-like functionality inside the SDK:

user-defined categorical metrics
BigQuery-native batch execution when possible
real-time or near-real-time categorical evaluation as the end-state
Gemini API fallback when BigQuery-native execution is unavailable
optional persistence of classification results back to BigQuery

Why This Is Worth Adding

Today the SDK supports two major evaluation modes:

deterministic numeric scoring via SystemEvaluator
semantic numeric scoring via LLMAsJudge

What is missing is a first-class way to answer questions like:

What kind of issue is this conversation about?
Was the user frustrated, satisfied, or neutral?
Did the agent behavior fall into one of several reviewed quality buckets?
Which business-defined category best describes the session outcome?

That capability is useful for:

privacy-reviewed or business-reviewed categorical taxonomies
operational dashboards and routing
continuous quality monitoring
post-hoc labeling for analytics, evaluation, and experimentation

Design Principles

Keep the current Client.evaluate() contract focused on numeric evaluation.
Add categorical evaluation as a parallel API surface, not a hidden extension of numeric reports.
Prefer one model call per session that returns all requested metric classifications.
Use structured output and strict category validation.
Support both:
- BigQuery-native execution via AI.GENERATE
- Gemini API fallback for environments where BigQuery-native execution is unavailable or undesired
Make persistence optional in phase 1, but mandatory in the final real-time architecture.

Non-Goals

This design is not proposing:

a full clone of an external Hatteras service
a replacement for SystemEvaluator
a replacement for LLMAsJudge
a new remote function or Python UDF surface in the first phase
real-time ingestion-time classification in phase 1

Phase 1 should be batch or near-real-time evaluation over existing traces in BigQuery.

Proposed SDK Surface

New configuration models

from typing import Optional

from pydantic import BaseModel
from pydantic import Field


class CategoricalMetricCategory(BaseModel):
    name: str = Field(description="Category label.")
    definition: str = Field(description="What this category means.")


class CategoricalMetricDefinition(BaseModel):
    name: str = Field(description="Metric name.")
    definition: str = Field(description="What this metric measures.")
    categories: list[CategoricalMetricCategory] = Field(
        description="Allowed categories for this metric.",
    )
    required: bool = Field(
        default=True,
        description="Whether this metric must be classified.",
    )


class CategoricalEvaluationConfig(BaseModel):
    metrics: list[CategoricalMetricDefinition] = Field(
        description="Metrics to evaluate.",
    )
    endpoint: str = Field(
        default="gemini-2.5-flash",
        description="Model endpoint for classification.",
    )
    temperature: float = Field(
        default=0.0,
        description="Sampling temperature.",
    )
    persist_results: bool = Field(
        default=False,
        description="Write results to BigQuery.",
    )
    results_table: Optional[str] = Field(
        default=None,
        description="Destination table for results.",
    )
    include_justification: bool = Field(
        default=True,
        description="Include justification in output.",
    )
    prompt_version: Optional[str] = Field(
        default=None,
        description="Tracks prompt version for reproducibility.",
    )

New result models

from datetime import datetime
from datetime import timezone
from typing import Any, Optional

from pydantic import BaseModel
from pydantic import Field


class CategoricalMetricResult(BaseModel):
    metric_name: str
    category: Optional[str] = None
    passed_validation: bool = True
    justification: Optional[str] = None
    raw_response: Optional[str] = None
    parse_error: bool = False


class CategoricalSessionResult(BaseModel):
    session_id: str
    metrics: list[CategoricalMetricResult] = Field(default_factory=list)
    details: dict[str, Any] = Field(default_factory=dict)


class CategoricalEvaluationReport(BaseModel):
    dataset: str
    evaluator_name: str = "categorical_evaluator"
    total_sessions: int = 0
    category_distributions: dict[str, dict[str, int]] = Field(
        default_factory=dict,
        description="Maps metric_name → {category → count}.",
    )
    details: dict[str, Any] = Field(default_factory=dict)
    session_results: list[CategoricalSessionResult] = Field(default_factory=list)
    created_at: datetime = Field(
        default_factory=lambda: datetime.now(timezone.utc),
    )

New client method

report = client.evaluate_categorical(
    config=config,
    filters=TraceFilter.from_cli_args(last="24h"),
    dataset="agent_events",
)

This should be a separate method, not an overload of Client.evaluate(), because the current numeric report shape (aggregate_scores, SessionScore.scores) is not a clean fit for label-valued results.

Execution Model

Preferred path: BigQuery-native batch classification

Use the same design direction already used elsewhere in the SDK:

SQL builds one transcript per session
AI.GENERATE performs the classification
structured output is returned in typed columns
the SDK validates categories against the configured allowlist

The key design choice is:

one model call per session should return all metric classifications

instead of:

one model call per metric per session

because the latter is slower, more expensive, and harder to reason about operationally.

Example output shape

Conceptually, the model should return a structure like:

results ARRAY<STRUCT<
  metric_name STRING,
  category STRING,
  justification STRING
>>

If AI.GENERATE cannot directly express the final nested schema cleanly, phase 1 can use a constrained string/JSON output plus strict SDK-side validation. But the target should be structured output, not substring parsing.

Concrete `output_schema` decision

Phase 1 uses a single STRING column containing a JSON array of metric results rather than attempting a fully typed ARRAY<STRUCT<...>> output_schema:

output_schema => 'classifications STRING'

This is a conservative design choice, not a proven platform limitation. AI.GENERATE's output_schema supports typed arrays (the SDK's insights.py already uses ARRAY<STRING> columns like goal_categories and friction_types), but the multi-metric classification payload — an array of structs with heterogeneous fields — has not been validated against output_schema yet. Starting with a JSON STRING envelope keeps phase 1 unblocked while that validation happens.

The SDK parses the JSON string and validates each entry's category value against the configured allowlist.

Endpoint precedence

When config.endpoint is set, it overrides Client.endpoint. This matches the existing SDK pattern where evaluator-level config takes precedence over client defaults.

Prompting and Validation

Each metric definition supplies:

metric name
metric definition
allowed categories
category definitions

The prompt should instruct the model to:

classify the full session for every configured metric
choose exactly one category from the allowed set for each metric
optionally return a short justification
return no extra categories or free-form labels

The SDK should then:

normalize category names
validate that each category is in the configured allowlist
mark invalid or missing outputs with parse_error=True
store raw output in raw_response for debugging

The implementation should explicitly avoid substring matching like if cat in response, because that is too weak for production categorical evaluation.

BigQuery-Native SQL Shape

The first implementation should follow the same transcript-building pattern already used by LLMAsJudge and insights:

Filter sessions with TraceFilter
Build one ordered transcript per session_id
Call AI.GENERATE once per session
Read typed output columns or parse a constrained JSON/string envelope
Convert rows into CategoricalSessionResult

This keeps the execution model aligned with the existing SDK design instead of introducing a separate trace-by-trace Python-only path.

Concrete SQL template

WITH session_transcripts AS (
  SELECT
    session_id,
    STRING_AGG(
      CONCAT(
        event_type,
        COALESCE(CONCAT(' [', agent, ']'), ''),
        ': ',
        COALESCE(
          JSON_VALUE(content, '$.text_summary'),
          JSON_VALUE(content, '$.response'),
          JSON_VALUE(content, '$.tool'),
          ''
        )
      ),
      '\n' ORDER BY timestamp
    ) AS transcript
  FROM `{project}.{dataset}.{table}`
  WHERE {where}
  GROUP BY session_id
  HAVING LENGTH(transcript) > 10
  LIMIT @trace_limit
)
SELECT
  session_id,
  transcript,
  (AI.GENERATE(
    CONCAT(
      @categorical_prompt,
      '\n\nTranscript:\n', transcript
    ),
    endpoint => '{endpoint}',
    model_params => JSON '{"generationConfig": {"temperature": 0.0, "maxOutputTokens": 1024}}',
    output_schema => 'classifications STRING'
  )).classifications AS classifications
FROM session_transcripts

AI.GENERATE is a scalar function: the first positional argument is the prompt, and it returns a STRUCT containing the output_schema fields plus full_response and status. The model_params JSON must conform to the Gemini GenerateContent request body format.

Real-Time Evaluation and Dashboard End-State

Real-time evaluation is an important product goal for this design. The final system should support not only offline or scheduled batch classification, but also a near-real-time pipeline that produces continuously updating categorical metrics for dashboards and alerting.

The key point is that phase 1 does not need to deliver true streaming classification, but the architecture should be shaped so the later real-time path does not require a redesign of the metric definitions, result schema, or validation logic.

End-state goal

After the full implementation is complete, the SDK should support:

categorical evaluation of recent sessions with low operational delay
persisted session-level categorical results in BigQuery
dashboard-friendly tables or views for category trends, distributions, and error counts
alerting or monitoring built on top of persisted categorical outputs

Recommended architecture

The real-time path should be built as a pipeline with clear separation of responsibilities:

agent events land in the canonical agent_events table
a transcript-building step produces session-ready or recent-session-ready inputs
a categorical evaluation step classifies those sessions using the same metric definitions as the batch path
validated results are written to a durable BigQuery results table
dashboard queries or views read from the persisted results table instead of calling the model live on every dashboard refresh

This matters because dashboards should read from stable persisted outputs, not invoke LLM evaluation inline. Inline evaluation would be slower, more expensive, and harder to audit.

Real-time delivery options

The implementation can support multiple execution modes over time:

scheduled micro-batch evaluation over the last N minutes of sessions
event-driven evaluation triggered after session completion
hybrid mode where batch backfills and real-time classification share the same parsing and persistence code

The recommended first real-time milestone is scheduled micro-batch execution, because it is operationally simpler than true per-event streaming and still good enough for dashboards that refresh every few minutes.

Dashboard support

Dashboard support should be treated as a first-class outcome of this design, not a side effect.

The persisted results model should support:

category distribution over time
breakdown by agent, environment, experiment, or user segment
top failure or escalation categories
recent-session monitoring panels
drill-down from aggregate counts to example sessions

To support that, the results table should be paired with stable views or example queries such as:

daily category counts by metric
rolling 1-hour category counts
per-agent category distribution
parse error rate and model fallback rate

Required invariants for the real-time path

To make batch and real-time coexist cleanly, both paths should share:

the same metric-definition schema
the same prompt-building logic
the same category validation rules
the same persisted results schema
the same execution metadata fields, especially execution_mode, prompt_version, endpoint, and created_at

If those invariants hold, dashboards and downstream consumers do not need to care whether a classification came from batch execution or a real-time path.

Suggested final-phase deliverables

The final state of this work should include:

real-time or near-real-time categorical evaluation pipeline
durable BigQuery table for categorical evaluation results
dashboard-oriented example SQL or views
documentation for recommended dashboard patterns
operational metrics for fallback rate, parse error rate, and evaluation latency

This is the right end-state because the value of categorical evaluation is not just producing labels. The value is making those labels continuously usable for monitoring, reporting, and decision-making.

Fallback Path

When BigQuery-native execution is unavailable or fails, fall back to Gemini API execution.

The fallback path should:

reuse the same metric definitions
reuse the same strict validation logic
record execution mode in report.details
preserve as much result shape parity as possible

Example detail fields:

execution_mode: "ai_generate" | "api_fallback"
endpoint
parse_errors
parse_error_rate
persisted

Persistence Design

Persistence should be optional but designed in phase 1.

Relationship to Existing SDK Components

The implementation lives in a new module: src/bigquery_agent_analytics/categorical_evaluator.py. It does not belong in evaluators.py, which is focused on numeric scoring contracts.

This proposal should reuse and align with current SDK patterns:

TraceFilter for session selection
Client.get_session_trace() transcript semantics
BigQuery-native AI.GENERATE execution patterns already used in insights and other modules
Gemini API fallback patterns already used by LLMAsJudge

This proposal should not try to force categorical labels into:

SessionScore.scores: dict[str, float]
EvaluationReport.aggregate_scores: dict[str, float]

Those numeric contracts should remain numeric.

Suggested Implementation Phases

Phase 1: Models + BigQuery-native execution (working backend)

Add Pydantic config, metric, and result models in categorical_evaluator.py
Add Client.evaluate_categorical(...) entry point
Build session-transcript SQL and AI.GENERATE query template
Return one classification payload per session for all metrics
Add strict validation and parsing helpers (category allowlist check, JSON envelope parsing)
Add unit tests for metric definition validation and result parsing
Add integration tests for typed row parsing and invalid-category handling

Phase 2: Gemini API fallback

Add API execution path
Preserve report shape parity with the BigQuery-native path
Add tests for fallback behavior and partial failures

Phase 3: Persistence

Add result table DDL/template
Add persist_results=True flow
Add export/query helpers if needed

Phase 4: CLI exposure

Optional CLI wrapper after the Python API stabilizes
Keep this out of the initial implementation scope

Phase 5: Real-time evaluation and dashboard support

Add scheduled micro-batch or event-driven execution for recent sessions
Reuse the same metric definitions and validation logic as the batch path
Persist real-time classifications into the same results schema
Add dashboard-oriented views or example SQL
Document recommended dashboard and alerting patterns

Open Questions

~~Can AI.GENERATE express the desired multi-metric structured output directly, or should phase 1 use a constrained JSON/string envelope?~~ Decided: phase 1 uses a JSON STRING envelope (output_schema => 'classifications STRING'). This is a conservative choice while ARRAY<STRUCT<...>> support in output_schema is validated; see the Concrete output_schema decision section.
~~Should one session produce exactly one result row, or should results be flattened to one row per (session_id, metric_name) during parsing/persistence?~~ Decided: flatten to one row per (session_id, metric_name) for persistence. This matches the results table schema shown in the Persistence Design section.
Should justifications be stored by default, or optional because of privacy or storage concerns?
Should the first release support only session-level classification, or also last-turn-only classification?
~~Should categorical evaluation live in evaluators.py, or in a new module such as categorical_evaluators.py to avoid conflating numeric and label-based contracts?~~ Decided: new module categorical_evaluator.py. See the Relationship to Existing SDK Components section.

Acceptance Criteria

Users can define one or more categorical metrics with explicit allowed categories.
Users can run categorical evaluation over filtered sessions in BigQuery.
The SDK validates model output against the configured category set.
Invalid or missing outputs are surfaced explicitly as parse errors.
Results are returned in a dedicated categorical report type.
BigQuery-native execution is the preferred path.
Gemini API fallback works when BigQuery-native execution is unavailable.
Optional persistence path is clearly defined.

Why This Approach Fits the SDK

The important design choices are:

categorical evaluation is a new API surface, not a hidden extension of numeric evaluation
execution is batch-first, not one LLM call per metric per trace
parsing is strict and schema-driven, not substring-based
result models are label-oriented, not forced into numeric score containers

That makes the feature much more likely to integrate cleanly with the current SDK and to scale operationally.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hatteras-Style Categorical Evaluation Design

Summary

Why This Is Worth Adding

Design Principles

Non-Goals

Proposed SDK Surface

New configuration models

New result models

New client method

Execution Model

Preferred path: BigQuery-native batch classification

Example output shape

Concrete `output_schema` decision

Endpoint precedence

Prompting and Validation

BigQuery-Native SQL Shape

Concrete SQL template

Real-Time Evaluation and Dashboard End-State

End-state goal

Recommended architecture

Real-time delivery options

Dashboard support

Required invariants for the real-time path

Suggested final-phase deliverables

Fallback Path

Persistence Design

Suggested results table schema

Relationship to Existing SDK Components

Suggested Implementation Phases

Phase 1: Models + BigQuery-native execution (working backend)

Phase 2: Gemini API fallback

Phase 3: Persistence

Phase 4: CLI exposure

Phase 5: Real-time evaluation and dashboard support

Open Questions

Acceptance Criteria

Why This Approach Fits the SDK

FilesExpand file tree

hatteras_evaluation.md

Latest commit

History

hatteras_evaluation.md

File metadata and controls

Hatteras-Style Categorical Evaluation Design

Summary

Why This Is Worth Adding

Design Principles

Non-Goals

Proposed SDK Surface

New configuration models

New result models

New client method

Execution Model

Preferred path: BigQuery-native batch classification

Example output shape

Concrete output_schema decision

Endpoint precedence

Prompting and Validation

BigQuery-Native SQL Shape

Concrete SQL template

Real-Time Evaluation and Dashboard End-State

End-state goal

Recommended architecture

Real-time delivery options

Dashboard support

Required invariants for the real-time path

Suggested final-phase deliverables

Fallback Path

Persistence Design

Suggested results table schema

Relationship to Existing SDK Components

Suggested Implementation Phases

Phase 1: Models + BigQuery-native execution (working backend)

Phase 2: Gemini API fallback

Phase 3: Persistence

Phase 4: CLI exposure

Phase 5: Real-time evaluation and dashboard support

Open Questions

Acceptance Criteria

Why This Approach Fits the SDK

Concrete `output_schema` decision