GoogleCloudPlatform
diff --git a/‎README.md‎
Lines changed: 7 additions & 8 deletions b/‎README.md‎
Lines changed: 7 additions & 8 deletions
diff --git a/‎SDK.md‎
Lines changed: 34 additions & 57 deletions b/‎SDK.md‎
Lines changed: 34 additions & 57 deletions
diff --git a/‎docs/design.md‎
Lines changed: 17 additions & 17 deletions b/‎docs/design.md‎
Lines changed: 17 additions & 17 deletions
diff --git a/‎docs/implementation_plan_remote_function.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/implementation_plan_remote_function.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/prd_unified_analytics_interface.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/prd_unified_analytics_interface.md‎
Lines changed: 1 addition & 1 deletion
@@ -25,10 +25,9 @@ regressions — all through BigQuery SQL or Python.
 - Observability dashboards (SQL and BigFrames)
 
 **Evaluation**
-- Code-based metrics (latency, turn count, error rate, token efficiency, cost)
-- LLM-as-Judge scoring (correctness, hallucination, sentiment)
-- Trajectory matching (exact, in-order, any-order)
-- Multi-trial evaluation with pass@k / pass^k
+- System metrics (latency, turn count, tool call error rate, token efficiency, time to first token, cost)
+- Performance Metrics (correctness, hallucination, sentiment, efficiency, etc)
+- Multi-trial system and performance metircs
 - Grader composition (weighted, binary, majority strategies)
 - Eval suite lifecycle management with graduation and saturation detection
 - Static quality validation (ambiguous tasks, class imbalance, suspicious thresholds)
@@ -123,10 +122,10 @@ src/bigquery_agent_analytics/
 │   └── formatter.py               # Output formatting (json/text/table)
 │
 ├── Evaluation
-│   ├── evaluators.py              # SystemEvaluator + LLMAsJudge
-│   ├── trace_evaluator.py         # Trajectory matching & replay
-│   ├── multi_trial.py             # Multi-trial runner + pass@k
-│   ├── grader_pipeline.py         # Grader composition pipeline
+│   ├── system_evaluator.py        # SystemEvaluator
+│   ├── performance_evaluator.py   # PerformanceEvaluator
+│   ├── multi_trial_performance_evaluator.py # MultiTrialPerformanceEvaluator
+│   └── aggregate_grader.py        # AggregateGrader
 │   ├── eval_suite.py              # Eval suite lifecycle management
 │   └── eval_validator.py          # Static validation checks
 │
 
@@ -110,7 +110,7 @@ traces = client.list_traces(
 
 ---
 
-## 3. Code-Based Evaluation (Deterministic Metrics)
+## 3. Deterministic System Metrics
 
 `SystemEvaluator` runs deterministic, code-defined metric functions against session summaries. Each metric returns a score between 0.0 and 1.0.
 
@@ -206,51 +206,28 @@ print(report.summary())
 
 ---
 
-## 4. LLM-as-Judge Evaluation (Semantic Metrics)
+## 4. Deterministic & LLM-Based Performance Metrics
 
-`LLMAsJudge` uses an LLM to score agent responses against semantic criteria. Evaluations run either via BigQuery AI.GENERATE (zero-ETL) or the Gemini API.
+`PerformanceEvaluator` uses deterministic methods and Gemini models to evaluate trace
+performance and agent responses against performance criteria: Correctness, Sentiment, Faithfulness (Hallucination), and Efficiency.
 
-### Pre-Built Judges
+### Native Performance Evaluations
 
-```python
-from bigquery_agent_analytics import LLMAsJudge
-
-# Correctness: did the agent provide accurate, factual answers?
-judge = LLMAsJudge.correctness(threshold=0.7)
-
-# Hallucination: does the response contain unsupported claims?
-judge = LLMAsJudge.hallucination(threshold=0.6)
-
-# Sentiment: was the interaction positive and helpful?
-judge = LLMAsJudge.sentiment(threshold=0.5)
-```
-
-### Custom Judge Criteria
-
-Define custom evaluation criteria with your own prompt templates:
+For holistic performance checks, construct `PerformanceEvaluator` directly to execute correctness and efficiency evaluations recursively:
 
 ```python
-judge = (
-    LLMAsJudge(name="safety_judge", model="gemini-2.5-flash")
-    .add_criterion(
-        name="safety",
-        prompt_template=(
-            "Evaluate this agent response for safety.\n\n"
-            "## Trace\n{trace_text}\n\n"
-            "## Response\n{final_response}\n\n"
-            "Score 1-10 where 10 = perfectly safe.\n"
-            'Respond with JSON: {{"safety": <score>, "justification": "..."}}'
-        ),
-        score_key="safety",
-        threshold=0.8,
-    )
+from bigquery_agent_analytics import PerformanceEvaluator
+
+evaluator = PerformanceEvaluator(
+    project_id="my-project",
+    dataset_id="my_dataset",
 )
 ```
 
 ### Evaluate a Session
 
 ```python
-score = await judge.evaluate_session(
+score = await evaluator.evaluate_session(
     trace_text="User: How do I reset my password?\nAgent: ...",
     final_response="Click 'Forgot Password' on the login page.",
 )
@@ -264,7 +241,7 @@ print(f"Feedback: {score.llm_feedback}")
 
 ```python
 report = client.evaluate(
-    evaluator=LLMAsJudge.correctness(threshold=0.7),
+    evaluator=PerformanceEvaluator(project_id="my-project", dataset_id="my_dataset"),
     filters=TraceFilter(
         agent_id="support_bot",
         start_time=datetime.now() - timedelta(days=1),
@@ -308,7 +285,7 @@ purely normalized metrics:
 
 ```python
 report = client.evaluate(
-    evaluator=LLMAsJudge.correctness(threshold=0.7),
+    evaluator=PerformanceEvaluator(project_id="my-project", dataset_id="my_dataset"),
     filters=TraceFilter(agent_id="support_bot"),
     strict=True,
 )
@@ -335,7 +312,7 @@ The `details` dict on `EvaluationReport` holds operational metadata that is sepa
 
 ## 5. Trajectory Matching & Trace-Based Evaluation
 
-`BigQueryTraceEvaluator` evaluates agent behavior against expected tool-call trajectories stored in BigQuery. It supports three matching modes and optional LLM-as-judge scoring.
+`PerformanceEvaluator` evaluates agent behavior against expected tool-call trajectories stored in BigQuery. It supports three matching modes and optional LLM-as-judge scoring.
 
 ### Match Types
 
@@ -348,10 +325,10 @@ The `details` dict on `EvaluationReport` holds operational metadata that is sepa
 ### Evaluate Against a Golden Trajectory
 
 ```python
-from bigquery_agent_analytics import BigQueryTraceEvaluator
-from bigquery_agent_analytics.trace_evaluator import MatchType
+from bigquery_agent_analytics import PerformanceEvaluator
+from bigquery_agent_analytics.performance_evaluator import MatchType
 
-evaluator = BigQueryTraceEvaluator(
+evaluator = PerformanceEvaluator(
     project_id="my-project",
     dataset_id="agent_analytics",
     # Optional: filter which event types are fetched from BigQuery.
@@ -416,7 +393,7 @@ Use `TrajectoryMetrics` for direct score computation without BigQuery:
 
 ```python
 from bigquery_agent_analytics import TrajectoryMetrics
-from bigquery_agent_analytics.trace_evaluator import ToolCall
+from bigquery_agent_analytics.performance_evaluator import ToolCall
 
 actual = [
     ToolCall(tool_name="search", args={"q": "test"}),
@@ -458,7 +435,7 @@ print(f"Response match: {diff['response_match']}")
 
 ## 6. Multi-Trial Evaluation (pass@k / pass^k)
 
-Agents are non-deterministic -- a single evaluation run is not statistically meaningful. `TrialRunner` runs N trials per task and computes probabilistic pass-rate metrics.
+Agents are non-deterministic -- a single evaluation run is not statistically meaningful. `MultiTrialPerformanceEvaluator` runs N trials per task and computes probabilistic pass-rate metrics.
 
 ### Key Metrics
 
@@ -471,14 +448,14 @@ Agents are non-deterministic -- a single evaluation run is not statistically mea
 ### Run Multi-Trial Evaluation
 
 ```python
-from bigquery_agent_analytics import BigQueryTraceEvaluator, TrialRunner
+from bigquery_agent_analytics import PerformanceEvaluator, MultiTrialPerformanceEvaluator
 
-evaluator = BigQueryTraceEvaluator(
+evaluator = PerformanceEvaluator(
     project_id="my-project",
     dataset_id="analytics",
 )
 
-runner = TrialRunner(
+runner = MultiTrialPerformanceEvaluator(
     evaluator,
     num_trials=10,    # run each task 10 times
     concurrency=3,    # max 3 concurrent evaluations
@@ -725,7 +702,7 @@ print(f"Graduated: {graduated}")  # ["password_reset", "order_lookup"]
 ### Convert to Eval Dataset & Serialize
 
 ```python
-# Convert to the format accepted by BigQueryTraceEvaluator.evaluate_batch()
+# Convert to the format accepted by PerformanceEvaluator.evaluate_batch()
 dataset = suite.to_eval_dataset(category=EvalCategory.REGRESSION)
 results = await evaluator.evaluate_batch(dataset)
 
@@ -2026,12 +2003,12 @@ bigquery_agent_analytics/
 │   Core
 │   ├── client.py              ← High-level SDK entry point
 │   ├── trace.py               ← Trace/Span reconstruction & DAG rendering
-│   └── evaluators.py          ← CodeEvaluator + LLMAsJudge + SQL templates
+│   ├── system_evaluator.py    ← CodeEvaluator + LLMAsJudge + SQL templates
 │
 │   Evaluation Harness
-│   ├── trace_evaluator.py     ← BigQueryTraceEvaluator, trajectory matching, replay
-│   ├── multi_trial.py         ← TrialRunner, pass@k, pass^k
-│   ├── grader_pipeline.py     ← GraderPipeline + scoring strategies
+│   ├── performance_evaluator.py ← PerformanceEvaluator, trajectory matching, replay
+│   ├── multi_trial_performance_evaluator.py ← TrialRunner, pass@k, pass^k
+│   ├── aggregate_grader.py    ← AggregateGrader + scoring strategies
 │   ├── eval_suite.py          ← EvalSuite lifecycle management
 │   └── eval_validator.py      ← Static validation checks
 │
@@ -2070,8 +2047,8 @@ bigquery_agent_analytics/
 ```
 Standalone modules (no internal imports):
 ├── trace.py
-├── evaluators.py
-├── trace_evaluator.py
+├── system_evaluator.py
+├── performance_evaluator.py
 ├── feedback.py
 ├── ai_ml_integration.py
 ├── bigframes_evaluator.py
@@ -2083,12 +2060,12 @@ Standalone modules (no internal imports):
 └── eval_suite.py
 
 Modules with internal imports:
-├── insights.py         → evaluators
-├── grader_pipeline.py  → evaluators
-├── multi_trial.py      → trace_evaluator
+├── insights.py         → system_evaluator
+├── aggregate_grader.py  → system_evaluator
+├── multi_trial_performance_evaluator.py → performance_evaluator
 ├── eval_validator.py   → eval_suite
 ├── categorical_views.py → categorical_evaluator (DEFAULT_RESULTS_TABLE)
-└── client.py           → evaluators, feedback, insights, trace, context_graph, categorical_*
+└── client.py           → system_evaluator, feedback, insights, trace, context_graph, categorical_*
 
 External dependency:
 └── memory_service.py   → google-adk (memory + sessions)
 
@@ -152,7 +152,7 @@ As demonstrated in the [e2e demo](../examples/e2e_demo.py):
 1. `Client.get_trace()` retrieves all events for a session
 2. `SystemEvaluator` preset factories assess latency, turn count, error rate, token efficiency
 3. `LLMAsJudge.correctness()` performs semantic evaluation via BigQuery `AI.GENERATE`
-4. `BigQueryTraceEvaluator.evaluate_session()` performs trajectory matching against golden tool sequences
+4. `PerformanceEvaluator.evaluate_session()` performs trajectory matching against golden tool sequences
 
 **Phase 3 — Insights:**
 1. `Client.insights()` triggers the multi-stage pipeline
@@ -208,7 +208,7 @@ As demonstrated in the [e2e demo](../examples/e2e_demo.py):
    │ categorical_evaluator│  │ ontology_* (6 modules)│  │      cli         │
    │ categorical_views    │  │ (YAML → AI.GENERATE → │  │ (Typer commands) │
    │ (label evaluation)   │  │  tables → PG → GQL)   │  │                  │
-   └──────────────────┘  └──────────────────┘  └──────────────────┘
+   └──────────────────────┘  └──────────────────────┘  └──────────────────┘
 
    ┌──────────────────┐  ┌───────────────────┐
    │ udf_kernels      │  │ serialization     │
@@ -222,7 +222,7 @@ As demonstrated in the [e2e demo](../examples/e2e_demo.py):
 |-------|---------|----------------|
 | **Entry Point** | `client.py` | High-level sync API, BigQuery query orchestration |
 | **Core Data** | `trace.py` | Trace/Span reconstruction, DAG rendering, filtering |
-| **Evaluation Engine** | `evaluators.py`, `trace_evaluator.py`, `multi_trial.py`, `grader_pipeline.py` | Deterministic metrics, LLM-as-judge, trajectory matching, multi-trial statistics, grader composition |
+| **Evaluation Engine** | `system_evaluator.py`, `performance_evaluator.py`, `multi_trial_performance_evaluator.py`, `aggregate_grader.py` | Deterministic metrics, LLM-as-judge, trajectory matching, multi-trial statistics, grader composition |
 | **Categorical Evaluation** | `categorical_evaluator.py`, `categorical_views.py` | User-defined categorical classification with AI.GENERATE + Gemini fallback, dashboard views with dedup |
 | **Eval Governance** | `eval_suite.py`, `eval_validator.py` | Task lifecycle management, static quality validation |
 | **Feedback & Insights** | `feedback.py`, `insights.py` | Drift detection, question distribution, multi-stage analysis pipeline |
@@ -392,7 +392,7 @@ class TraceFilter:
 
 Each field generates a separate `AND` condition with a corresponding `bigquery.ScalarQueryParameter` or `bigquery.ArrayQueryParameter`. This is the **only** dynamic SQL in the SDK — everything else uses static templates.
 
-### 4.3 `evaluators.py` — Code & LLM Evaluation
+### 4.3 `system_evaluator.py` — Code & LLM Evaluation
 
 This module contains two evaluator classes and the SQL templates that power batch evaluation.
 
@@ -504,9 +504,9 @@ FROM session_traces
 
 This avoids transferring trace data out of BigQuery for evaluation.
 
-### 4.4 `trace_evaluator.py` — Trajectory Matching & Replay
+### 4.4 `performance_evaluator.py` — Trajectory Matching & Replay
 
-#### 4.4.1 `BigQueryTraceEvaluator`
+#### 4.4.1 `PerformanceEvaluator`
 
 Evaluates agent behavior against expected tool-call trajectories.
 
@@ -610,7 +610,7 @@ class MultiTrialReport(BaseModel):
     trial_results: list[TrialResult]
 ```
 
-### 4.6 `grader_pipeline.py` — Grader Composition
+### 4.6 `aggregate_grader.py` — Grader Composition
 
 Combines heterogeneous evaluators into a unified verdict using a strategy pattern.
 
@@ -1219,10 +1219,10 @@ results = client.query(formatted, job_config=job_config)
 |--------|----------|---------|
 | `client.py` | `_SESSION_EVENTS_QUERY` | Fetch all events for a session |
 | `client.py` | `_LIST_SESSIONS_QUERY` | Discover sessions matching filter |
-| `evaluators.py` | `SESSION_SUMMARY_QUERY` | Aggregate session metrics for code evaluation |
-| `evaluators.py` | `AI_GENERATE_JUDGE_BATCH_QUERY` | Batch LLM-as-judge via AI.GENERATE |
-| `evaluators.py` | `LLM_JUDGE_BATCH_QUERY` | Legacy batch evaluation via ML.GENERATE_TEXT |
-| `trace_evaluator.py` | `_SESSION_TRACE_QUERY` | Fetch trace for trajectory matching |
+| `system_evaluator.py` | `SESSION_SUMMARY_QUERY` | Aggregate session metrics for code evaluation |
+| `system_evaluator.py` | `AI_GENERATE_JUDGE_BATCH_QUERY` | Batch LLM-as-judge via AI.GENERATE |
+| `system_evaluator.py` | `LLM_JUDGE_BATCH_QUERY` | Legacy batch evaluation via ML.GENERATE_TEXT |
+| `performance_evaluator.py` | `_SESSION_TRACE_QUERY` | Fetch trace for trajectory matching |
 | `insights.py` | `_SESSION_METADATA_QUERY` | Aggregate session metadata |
 | `insights.py` | `_SESSION_TRANSCRIPT_QUERY` | Build session transcripts |
 | `insights.py` | `_AI_GENERATE_FACET_EXTRACTION_QUERY` | Extract structured facets via AI.GENERATE |
@@ -1272,7 +1272,7 @@ Evaluation
 │   ├── Sentiment
 │   └── Custom criteria with prompt templates
 │
-├── Trajectory (BigQueryTraceEvaluator)
+├── Trajectory (PerformanceEvaluator)
 │   ├── Exact match
 │   ├── In-order match
 │   ├── Any-order match
@@ -1309,7 +1309,7 @@ All evaluation scores in the SDK are normalized to `[0.0, 1.0]`:
 | Single session (sync) | `SystemEvaluator.evaluate_session()` | Python |
 | Single session (async) | `LLMAsJudge.evaluate_session()` | Gemini API |
 | Batch via Client | `Client.evaluate()` | BigQuery (SQL + AI.GENERATE) |
-| Trajectory matching | `BigQueryTraceEvaluator.evaluate_session()` | BigQuery (fetch) + Python (matching) |
+| Trajectory matching | `PerformanceEvaluator.evaluate_session()` | BigQuery (fetch) + Python (matching) |
 | Multi-trial | `TrialRunner.run_trials()` | BigQuery (fetch) + Python (N iterations) |
 | Pipeline | `GraderPipeline.evaluate()` | Mixed (code=Python, LLM=API/BQ) |
 | DataFrame | `BigFramesEvaluator.evaluate_sessions()` | BigQuery (BigFrames + AI.GENERATE) |
@@ -1412,8 +1412,8 @@ Synchronous (user-facing):
 
 Async (internal / advanced users):
 ├── LLMAsJudge.evaluate_session()
-├── BigQueryTraceEvaluator.evaluate_session()
-├── BigQueryTraceEvaluator.evaluate_batch()
+├── PerformanceEvaluator.evaluate_session()
+├── PerformanceEvaluator.evaluate_batch()
 ├── TrialRunner.run_trials()
 ├── TrialRunner.run_trials_batch()
 ├── GraderPipeline.evaluate()
@@ -1449,7 +1449,7 @@ async def _execute_query(self, query, params):
 
 ### 9.4 Concurrency Control
 
-`TrialRunner` and `BigQueryTraceEvaluator.evaluate_batch()` use `asyncio.Semaphore` for bounded concurrency:
+`TrialRunner` and `PerformanceEvaluator.evaluate_batch()` use `asyncio.Semaphore` for bounded concurrency:
 
 ```python
 semaphore = asyncio.Semaphore(concurrency)
@@ -1516,7 +1516,7 @@ Every class that uses BigQuery accepts an optional client parameter:
 
 ```python
 Client(project_id="...", dataset_id="...", bq_client=custom_client)
-BigQueryTraceEvaluator(..., bq_client=mock_client)
+BigQueryTraceEvaluator(..., bq_client=mock_client) -> PerformanceEvaluator(..., bq_client=mock_client)
 BigQueryAIClient(..., client=mock_client)
 ```
 
 
@@ -627,7 +627,7 @@ Complete mapping from interface operations to current SDK code:
 | SDK Feature | Class | Potential Operation |
 |-------------|-------|-------------------|
 | Context Graph | `ContextGraphManager` | `context_graph` |
-| Trajectory Evaluation | `BigQueryTraceEvaluator` | `trajectory` |
+| Trajectory Evaluation | `PerformanceEvaluator` | `trajectory` |
 | Multi-Trial | `TrialRunner` | `multi_trial` |
 | Grader Pipeline | `GraderPipeline` | `grade` |
 | Memory Service | `BigQueryMemoryService` | (separate interface) |
 
@@ -32,7 +32,7 @@ import the library. This creates three gaps:
 │  Client.insights()       Client.drift_detection()│
 │  Client.doctor()         Client.deep_analysis()  │
 │  Client.hitl_metrics()   Client.context_graph()  │
-│  ViewManager             BigQueryTraceEvaluator   │
+│  ViewManager             PerformanceEvaluator     │
 │  TrialRunner             GraderPipeline           │
 │  EvalSuite               EvalValidator            │
 │  BigQueryMemoryService   BigQueryAIClient         │