Skip to content

feat(eval2): swap Laminar tracing for Arize Phoenix#873

Open
Felarof (felarof99) wants to merge 11 commits intodevfrom
feat/phoenix-evals
Open

feat(eval2): swap Laminar tracing for Arize Phoenix#873
Felarof (felarof99) wants to merge 11 commits intodevfrom
feat/phoenix-evals

Conversation

@felarof99
Copy link
Copy Markdown
Contributor

Summary

  • Replace @lmnr-ai/lmnr with a plain OpenTelemetry stack pointed at Phoenix's OTLP endpoint, using @arizeai/openinference-vercel to translate Vercel AI SDK spans into OpenInference semantic conventions.
  • Per-task sessions land via the session.id attribute (sessionId = <runId>-<queryId>); per-tool-call screenshots stay as a manual child span with the base64 PNG in output.value.
  • Smoke (benchmark-configs/agisdk-mini.jsonc) runs both tasks end-to-end against the cloud workspace, summary written, traces shipped.

Design

POC quality, not future-proof. The tracing.ts module owns all OTel + Phoenix wiring; its public surface (initTracing, getAiSdkTelemetry, withTaskSession, recordScreenshotSpan, flushTracing, isTracingEnabled, getTaskSessionId) is preserved so call sites in single-agent.ts need zero edits and eval-runner.ts only renames the helper. benchmark-config.ts swaps the laminar Zod block for phoenix (enabled, endpoint, apiKeyEnv?, projectName, sessionPrefix). Both JSONC configs (agisdk-mini.jsonc, agisdk-smoke.jsonc) point at the cloud Phoenix endpoint by default; a local phoenix serve is a one-line config change. See .llm/specs/2026-04-28-eval2-laminar-design.md for the broader eval2 design and .llm/0429_replace_laminar_with_phoenix/ for this swap's design + PRD.

Test plan

  • cd packages/browseros-agent/apps/eval2 && bun run typecheck exits 0.
  • bun run eval --config benchmark-configs/agisdk-mini.jsonc runs both agisdk tasks end-to-end (FAIL on grader is expected — agent quality, not infra).
  • results/<runId>/summary.json has phoenixSessionId populated for each task.
  • In the Phoenix UI, project browseros-eval2 shows two sessions named <runId>-<queryId>, each with LLM/tool spans and eval.step.screenshot entries that render the captured PNG.

🤖 Generated with Claude Code

Felarof (felarof99) and others added 11 commits April 28, 2026 17:05
Per the screenshots design note, capture a PNG screenshot after each tool
call and emit it as a manual DEFAULT span with the base64 data URL in the
output. Laminar auto-renders base64 in default span outputs, so each
ai.toolCall.<name> in a trace is followed by an eval.step.screenshot
sibling that shows page state at that step.

Failures are non-fatal — screenshot or span emit errors do not abort the run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace @lmnr-ai/lmnr with a plain OpenTelemetry stack pointed at Phoenix's
OTLP endpoint, using @arizeai/openinference-vercel to translate Vercel AI
SDK spans into OpenInference semantic conventions for Phoenix's UI.

- tracing.ts rewritten: NodeTracerProvider + OpenInferenceSimpleSpanProcessor
  + OTLPTraceExporter. Public surface preserved: initTracing, getAiSdkTelemetry,
  recordScreenshotSpan, flushTracing, isTracingEnabled, getTaskSessionId.
  withTaskTrace renamed to withTaskSession to reflect Phoenix's session model.
- benchmark-config.ts: laminar block replaced with phoenix block (enabled,
  endpoint, apiKeyEnv?, projectName, sessionPrefix).
- agisdk-mini.jsonc + agisdk-smoke.jsonc: pointed at the cloud Phoenix
  workspace; auth via PHOENIX_API_KEY env var.
- eval-runner.ts + types.ts: laminarSessionId renamed to phoenixSessionId.
- package.json: dropped @lmnr-ai/lmnr; added @arizeai/openinference-vercel,
  @arizeai/openinference-semantic-conventions, and core OTel packages
  (@opentelemetry/api, sdk-trace-node, exporter-trace-otlp-proto, resources,
  semantic-conventions) on the 1.30.x line.
- README.md: prereqs and verify section rewritten for Phoenix.

End-to-end smoke (benchmark-configs/agisdk-mini.jsonc) runs both tasks against
the cloud workspace at https://app.phoenix.arize.com/s/niffler92, project
browseros-eval2. Sessions land with phoenixSessionId = <runId>-<queryId>.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 29, 2026

Greptile Summary

This PR introduces a new eval2 app that replaces Laminar tracing with an Arize Phoenix / OpenTelemetry stack, wiring the Vercel AI SDK's experimental_telemetry hook through a new tracing.ts module and adding aiSdkTelemetry support to AiSdkAgent.

  • P1 – Silent auth failure in tracing.ts: When apiKeyEnv is configured but the env var is unset, initTracing continues without auth headers, so all OTLP exports to the cloud Phoenix endpoint are silently rejected. No warning is emitted.
  • P1 – Misleading "View traces" output when tracing is disabled: buildSummary always calls getTaskSessionId() regardless of isTracingEnabled(), so phoenixSessionId is never null in summary.json and the "View traces" console link always prints even when phoenix.enabled: false.

Confidence Score: 3/5

Two P1 logic bugs should be fixed before merging — both cause incorrect observable behaviour (silent trace loss and misleading console output).

Two independent P1 issues: one causes silent data loss (traces silently dropped when cloud API key is absent) and one produces actively misleading output (trace URL printed when tracing is disabled). Neither is catastrophic, but both undermine the core observability goal of the PR.

packages/browseros-agent/apps/eval2/src/tracing.ts and packages/browseros-agent/apps/eval2/src/eval-runner.ts

Important Files Changed

Filename Overview
packages/browseros-agent/apps/eval2/src/tracing.ts New Phoenix/OTel tracing module; silently skips auth headers when apiKeyEnv is set but the env var is missing, causing all cloud traces to fail with no user-visible error.
packages/browseros-agent/apps/eval2/src/eval-runner.ts Eval orchestration; always calls getTaskSessionId() regardless of tracing state, so summary.json always has non-null phoenixSessionId and the "View traces" console link always prints even when phoenix.enabled=false.
packages/browseros-agent/apps/eval2/src/benchmark-config.ts Swapped Laminar Zod block for phoenix block; validates required phoenix fields including endpoint URL; clean.
packages/browseros-agent/apps/eval2/src/single-agent.ts New single-agent runner that wires experimental_telemetry to getAiSdkTelemetry and records per-tool screenshots via recordScreenshotSpan; straightforward.
packages/browseros-agent/apps/server/src/agent/ai-sdk-agent.ts Minimal one-line addition: threads aiSdkTelemetry config through to experimental_telemetry on ToolLoopAgent; correct and non-breaking.
packages/browseros-agent/apps/eval2/src/index.ts Entry point; stale HELP text still mentions "Laminar-traced" after the provider swap.

Sequence Diagram

sequenceDiagram
    participant ER as eval-runner.ts
    participant SA as single-agent.ts
    participant TR as tracing.ts
    participant AISDK as AiSdkAgent
    participant OTEL as OTel / Phoenix

    ER->>TR: initTracing(config)
    TR->>OTEL: Register NodeTracerProvider + OTLPTraceExporter

    loop for each task
        ER->>TR: withTaskSession(task, config, runId, fn)
        TR->>OTEL: startActiveSpan(eval.task, {session.id, ...})
        ER->>SA: agent.runTask(task)
        SA->>TR: getAiSdkTelemetry(task, config, runId, conversationId)
        SA->>AISDK: generate({ experimental_telemetry })
        AISDK->>OTEL: emit LLM + tool-call spans

        loop per tool call
            SA->>SA: browser.screenshot()
            SA->>TR: recordScreenshotSpan(toolCallId, toolName, base64)
            TR->>OTEL: startSpan(eval.step.screenshot)
        end
    end

    ER->>TR: flushTracing()
    TR->>OTEL: forceFlush() + shutdown()
Loading

Comments Outside Diff (2)

  1. packages/browseros-agent/apps/eval2/src/tracing.ts, line 1659-1672 (link)

    P1 Silent trace export failure when API key env var is missing

    When apiKeyEnv is configured but the corresponding env var is absent, initTracing silently continues without auth headers. Every subsequent OTLP export to the cloud Phoenix endpoint will return 401/403, causing all traces to be silently dropped. The user sees "Phoenix tracing enabled" in the console but no spans land in Phoenix.

    Adding a guard that logs a warning and returns early (similar to how loadBenchmarkConfig throws when the OpenAI key is missing) would make this failure visible rather than silent.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: packages/browseros-agent/apps/eval2/src/tracing.ts
    Line: 1659-1672
    
    Comment:
    **Silent trace export failure when API key env var is missing**
    
    When `apiKeyEnv` is configured but the corresponding env var is absent, `initTracing` silently continues without auth headers. Every subsequent OTLP export to the cloud Phoenix endpoint will return 401/403, causing all traces to be silently dropped. The user sees `"Phoenix tracing enabled"` in the console but no spans land in Phoenix.
    
    Adding a guard that logs a warning and returns early (similar to how `loadBenchmarkConfig` throws when the OpenAI key is missing) would make this failure visible rather than silent.
    
    How can I resolve this? If you propose a fix, please make it concise.
  2. packages/browseros-agent/apps/eval2/src/eval-runner.ts, line 1186-1193 (link)

    P1 phoenixSessionId populated and "View traces" printed even when tracing is disabled

    getTaskSessionId always returns a non-null string, so every entry in summary.json will have phoenixSessionId set regardless of phoenix.enabled. The summary.tasks.some((task) => task.phoenixSessionId) guard at the end of runEval will always be true, causing the "View traces" URL to be printed even when phoenix.enabled: false — no traces exist at that URL.

    The fix is to gate on isTracingEnabled():

    phoenixSessionId: isTracingEnabled()
      ? getTaskSessionId(result.task, config, runId)
      : null,
    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: packages/browseros-agent/apps/eval2/src/eval-runner.ts
    Line: 1186-1193
    
    Comment:
    **`phoenixSessionId` populated and "View traces" printed even when tracing is disabled**
    
    `getTaskSessionId` always returns a non-null string, so every entry in `summary.json` will have `phoenixSessionId` set regardless of `phoenix.enabled`. The `summary.tasks.some((task) => task.phoenixSessionId)` guard at the end of `runEval` will always be `true`, causing the "View traces" URL to be printed even when `phoenix.enabled: false` — no traces exist at that URL.
    
    The fix is to gate on `isTracingEnabled()`:
    
    ```ts
    phoenixSessionId: isTracingEnabled()
      ? getTaskSessionId(result.task, config, runId)
      : null,
    ```
    
    How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
This is a comment left during a code review.
Path: packages/browseros-agent/apps/eval2/src/tracing.ts
Line: 1659-1672

Comment:
**Silent trace export failure when API key env var is missing**

When `apiKeyEnv` is configured but the corresponding env var is absent, `initTracing` silently continues without auth headers. Every subsequent OTLP export to the cloud Phoenix endpoint will return 401/403, causing all traces to be silently dropped. The user sees `"Phoenix tracing enabled"` in the console but no spans land in Phoenix.

Adding a guard that logs a warning and returns early (similar to how `loadBenchmarkConfig` throws when the OpenAI key is missing) would make this failure visible rather than silent.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: packages/browseros-agent/apps/eval2/src/eval-runner.ts
Line: 1186-1193

Comment:
**`phoenixSessionId` populated and "View traces" printed even when tracing is disabled**

`getTaskSessionId` always returns a non-null string, so every entry in `summary.json` will have `phoenixSessionId` set regardless of `phoenix.enabled`. The `summary.tasks.some((task) => task.phoenixSessionId)` guard at the end of `runEval` will always be `true`, causing the "View traces" URL to be printed even when `phoenix.enabled: false` — no traces exist at that URL.

The fix is to gate on `isTracingEnabled()`:

```ts
phoenixSessionId: isTracingEnabled()
  ? getTaskSessionId(result.task, config, runId)
  : null,
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: packages/browseros-agent/apps/eval2/src/index.ts
Line: 1372

Comment:
**Stale help text — still references Laminar**

The `HELP` string still reads `"eval2 - Laminar-traced eval runner"` after the swap to Phoenix.

```suggestion
eval2 - Phoenix-traced eval runner
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: packages/browseros-agent/apps/eval2/src/tracing.ts
Line: 1680-1686

Comment:
**`SimpleSpanProcessor` can cause back-pressure under load**

`OpenInferenceSimpleSpanProcessor` exports each span synchronously before returning, which can block the agent loop under high tool-call volume. For eval runs that produce many spans (screenshot + LLM + tool spans per step), consider using a `BatchSpanProcessor` wrapping the `OTLPTraceExporter` to export asynchronously.

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "feat(eval2): swap Laminar tracing for Ar..." | Re-trigger Greptile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant