feat(eval2): swap Laminar tracing for Arize Phoenix by felarof99 · Pull Request #873 · browseros-ai/BrowserOS

Felarof (felarof99) · 2026-04-29T21:29:53Z

Summary

Replace @lmnr-ai/lmnr with a plain OpenTelemetry stack pointed at Phoenix's OTLP endpoint, using @arizeai/openinference-vercel to translate Vercel AI SDK spans into OpenInference semantic conventions.
Per-task sessions land via the session.id attribute (sessionId = <runId>-<queryId>); per-tool-call screenshots stay as a manual child span with the base64 PNG in output.value.
Smoke (benchmark-configs/agisdk-mini.jsonc) runs both tasks end-to-end against the cloud workspace, summary written, traces shipped.

Design

POC quality, not future-proof. The tracing.ts module owns all OTel + Phoenix wiring; its public surface (initTracing, getAiSdkTelemetry, withTaskSession, recordScreenshotSpan, flushTracing, isTracingEnabled, getTaskSessionId) is preserved so call sites in single-agent.ts need zero edits and eval-runner.ts only renames the helper. benchmark-config.ts swaps the laminar Zod block for phoenix (enabled, endpoint, apiKeyEnv?, projectName, sessionPrefix). Both JSONC configs (agisdk-mini.jsonc, agisdk-smoke.jsonc) point at the cloud Phoenix endpoint by default; a local phoenix serve is a one-line config change. See .llm/specs/2026-04-28-eval2-laminar-design.md for the broader eval2 design and .llm/0429_replace_laminar_with_phoenix/ for this swap's design + PRD.

Test plan

cd packages/browseros-agent/apps/eval2 && bun run typecheck exits 0.
bun run eval --config benchmark-configs/agisdk-mini.jsonc runs both agisdk tasks end-to-end (FAIL on grader is expected — agent quality, not infra).
results/<runId>/summary.json has phoenixSessionId populated for each task.
In the Phoenix UI, project browseros-eval2 shows two sessions named <runId>-<queryId>, each with LLM/tool spans and eval.step.screenshot entries that render the captured PNG.

🤖 Generated with Claude Code

Per the screenshots design note, capture a PNG screenshot after each tool call and emit it as a manual DEFAULT span with the base64 data URL in the output. Laminar auto-renders base64 in default span outputs, so each ai.toolCall.<name> in a trace is followed by an eval.step.screenshot sibling that shows page state at that step. Failures are non-fatal — screenshot or span emit errors do not abort the run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replace @lmnr-ai/lmnr with a plain OpenTelemetry stack pointed at Phoenix's OTLP endpoint, using @arizeai/openinference-vercel to translate Vercel AI SDK spans into OpenInference semantic conventions for Phoenix's UI. - tracing.ts rewritten: NodeTracerProvider + OpenInferenceSimpleSpanProcessor + OTLPTraceExporter. Public surface preserved: initTracing, getAiSdkTelemetry, recordScreenshotSpan, flushTracing, isTracingEnabled, getTaskSessionId. withTaskTrace renamed to withTaskSession to reflect Phoenix's session model. - benchmark-config.ts: laminar block replaced with phoenix block (enabled, endpoint, apiKeyEnv?, projectName, sessionPrefix). - agisdk-mini.jsonc + agisdk-smoke.jsonc: pointed at the cloud Phoenix workspace; auth via PHOENIX_API_KEY env var. - eval-runner.ts + types.ts: laminarSessionId renamed to phoenixSessionId. - package.json: dropped @lmnr-ai/lmnr; added @arizeai/openinference-vercel, @arizeai/openinference-semantic-conventions, and core OTel packages (@opentelemetry/api, sdk-trace-node, exporter-trace-otlp-proto, resources, semantic-conventions) on the 1.30.x line. - README.md: prereqs and verify section rewritten for Phoenix. End-to-end smoke (benchmark-configs/agisdk-mini.jsonc) runs both tasks against the cloud workspace at https://app.phoenix.arize.com/s/niffler92, project browseros-eval2. Sessions land with phoenixSessionId = <runId>-<queryId>. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

greptile-apps · 2026-04-29T21:32:41Z

Greptile Summary

This PR introduces a new eval2 app that replaces Laminar tracing with an Arize Phoenix / OpenTelemetry stack, wiring the Vercel AI SDK's experimental_telemetry hook through a new tracing.ts module and adding aiSdkTelemetry support to AiSdkAgent.

P1 – Silent auth failure in tracing.ts: When apiKeyEnv is configured but the env var is unset, initTracing continues without auth headers, so all OTLP exports to the cloud Phoenix endpoint are silently rejected. No warning is emitted.
P1 – Misleading "View traces" output when tracing is disabled: buildSummary always calls getTaskSessionId() regardless of isTracingEnabled(), so phoenixSessionId is never null in summary.json and the "View traces" console link always prints even when phoenix.enabled: false.

Confidence Score: 3/5

Two P1 logic bugs should be fixed before merging — both cause incorrect observable behaviour (silent trace loss and misleading console output).

Two independent P1 issues: one causes silent data loss (traces silently dropped when cloud API key is absent) and one produces actively misleading output (trace URL printed when tracing is disabled). Neither is catastrophic, but both undermine the core observability goal of the PR.

packages/browseros-agent/apps/eval2/src/tracing.ts and packages/browseros-agent/apps/eval2/src/eval-runner.ts

Important Files Changed

Filename	Overview
packages/browseros-agent/apps/eval2/src/tracing.ts	New Phoenix/OTel tracing module; silently skips auth headers when apiKeyEnv is set but the env var is missing, causing all cloud traces to fail with no user-visible error.
packages/browseros-agent/apps/eval2/src/eval-runner.ts	Eval orchestration; always calls getTaskSessionId() regardless of tracing state, so summary.json always has non-null phoenixSessionId and the "View traces" console link always prints even when phoenix.enabled=false.
packages/browseros-agent/apps/eval2/src/benchmark-config.ts	Swapped Laminar Zod block for phoenix block; validates required phoenix fields including endpoint URL; clean.
packages/browseros-agent/apps/eval2/src/single-agent.ts	New single-agent runner that wires experimental_telemetry to getAiSdkTelemetry and records per-tool screenshots via recordScreenshotSpan; straightforward.
packages/browseros-agent/apps/server/src/agent/ai-sdk-agent.ts	Minimal one-line addition: threads aiSdkTelemetry config through to experimental_telemetry on ToolLoopAgent; correct and non-breaking.
packages/browseros-agent/apps/eval2/src/index.ts	Entry point; stale HELP text still mentions "Laminar-traced" after the provider swap.

Sequence Diagram

sequenceDiagram
    participant ER as eval-runner.ts
    participant SA as single-agent.ts
    participant TR as tracing.ts
    participant AISDK as AiSdkAgent
    participant OTEL as OTel / Phoenix

    ER->>TR: initTracing(config)
    TR->>OTEL: Register NodeTracerProvider + OTLPTraceExporter

    loop for each task
        ER->>TR: withTaskSession(task, config, runId, fn)
        TR->>OTEL: startActiveSpan(eval.task, {session.id, ...})
        ER->>SA: agent.runTask(task)
        SA->>TR: getAiSdkTelemetry(task, config, runId, conversationId)
        SA->>AISDK: generate({ experimental_telemetry })
        AISDK->>OTEL: emit LLM + tool-call spans

        loop per tool call
            SA->>SA: browser.screenshot()
            SA->>TR: recordScreenshotSpan(toolCallId, toolName, base64)
            TR->>OTEL: startSpan(eval.step.screenshot)
        end
    end

    ER->>TR: flushTracing()
    TR->>OTEL: forceFlush() + shutdown()

Comments Outside Diff (2)

packages/browseros-agent/apps/eval2/src/tracing.ts, line 1659-1672 (link)

Silent trace export failure when API key env var is missing

When apiKeyEnv is configured but the corresponding env var is absent, initTracing silently continues without auth headers. Every subsequent OTLP export to the cloud Phoenix endpoint will return 401/403, causing all traces to be silently dropped. The user sees "Phoenix tracing enabled" in the console but no spans land in Phoenix.

Adding a guard that logs a warning and returns early (similar to how loadBenchmarkConfig throws when the OpenAI key is missing) would make this failure visible rather than silent.

Prompt To Fix With AI

This is a comment left during a code review.
Path: packages/browseros-agent/apps/eval2/src/tracing.ts
Line: 1659-1672

Comment:
**Silent trace export failure when API key env var is missing**

When `apiKeyEnv` is configured but the corresponding env var is absent, `initTracing` silently continues without auth headers. Every subsequent OTLP export to the cloud Phoenix endpoint will return 401/403, causing all traces to be silently dropped. The user sees `"Phoenix tracing enabled"` in the console but no spans land in Phoenix.

Adding a guard that logs a warning and returns early (similar to how `loadBenchmarkConfig` throws when the OpenAI key is missing) would make this failure visible rather than silent.

How can I resolve this? If you propose a fix, please make it concise.

packages/browseros-agent/apps/eval2/src/eval-runner.ts, line 1186-1193 (link)

phoenixSessionId populated and "View traces" printed even when tracing is disabled

getTaskSessionId always returns a non-null string, so every entry in summary.json will have phoenixSessionId set regardless of phoenix.enabled. The summary.tasks.some((task) => task.phoenixSessionId) guard at the end of runEval will always be true, causing the "View traces" URL to be printed even when phoenix.enabled: false — no traces exist at that URL.

The fix is to gate on isTracingEnabled():

phoenixSessionId: isTracingEnabled()
  ? getTaskSessionId(result.task, config, runId)
  : null,

Prompt To Fix With AI

This is a comment left during a code review.
Path: packages/browseros-agent/apps/eval2/src/eval-runner.ts
Line: 1186-1193

Comment:
**`phoenixSessionId` populated and "View traces" printed even when tracing is disabled**

`getTaskSessionId` always returns a non-null string, so every entry in `summary.json` will have `phoenixSessionId` set regardless of `phoenix.enabled`. The `summary.tasks.some((task) => task.phoenixSessionId)` guard at the end of `runEval` will always be `true`, causing the "View traces" URL to be printed even when `phoenix.enabled: false` — no traces exist at that URL.

The fix is to gate on `isTracingEnabled()`:

```ts
phoenixSessionId: isTracingEnabled()
  ? getTaskSessionId(result.task, config, runId)
  : null,
```

How can I resolve this? If you propose a fix, please make it concise.

Prompt To Fix All With AI

This is a comment left during a code review.
Path: packages/browseros-agent/apps/eval2/src/tracing.ts
Line: 1659-1672

Comment:
**Silent trace export failure when API key env var is missing**

When `apiKeyEnv` is configured but the corresponding env var is absent, `initTracing` silently continues without auth headers. Every subsequent OTLP export to the cloud Phoenix endpoint will return 401/403, causing all traces to be silently dropped. The user sees `"Phoenix tracing enabled"` in the console but no spans land in Phoenix.

Adding a guard that logs a warning and returns early (similar to how `loadBenchmarkConfig` throws when the OpenAI key is missing) would make this failure visible rather than silent.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: packages/browseros-agent/apps/eval2/src/eval-runner.ts
Line: 1186-1193

Comment:
**`phoenixSessionId` populated and "View traces" printed even when tracing is disabled**

`getTaskSessionId` always returns a non-null string, so every entry in `summary.json` will have `phoenixSessionId` set regardless of `phoenix.enabled`. The `summary.tasks.some((task) => task.phoenixSessionId)` guard at the end of `runEval` will always be `true`, causing the "View traces" URL to be printed even when `phoenix.enabled: false` — no traces exist at that URL.

The fix is to gate on `isTracingEnabled()`:

```ts
phoenixSessionId: isTracingEnabled()
  ? getTaskSessionId(result.task, config, runId)
  : null,
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: packages/browseros-agent/apps/eval2/src/index.ts
Line: 1372

Comment:
**Stale help text — still references Laminar**

The `HELP` string still reads `"eval2 - Laminar-traced eval runner"` after the swap to Phoenix.

```suggestion
eval2 - Phoenix-traced eval runner
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: packages/browseros-agent/apps/eval2/src/tracing.ts
Line: 1680-1686

Comment:
**`SimpleSpanProcessor` can cause back-pressure under load**

`OpenInferenceSimpleSpanProcessor` exports each span synchronously before returning, which can block the agent loop under high tool-call volume. For eval runs that produce many spans (screenshot + LLM + tool spans per step), consider using a `BatchSpanProcessor` wrapping the `OTLPTraceExporter` to export asynchronously.

How can I resolve this? If you propose a fix, please make it concise.

_{Reviews (1): Last reviewed commit: "feat(eval2): swap Laminar tracing for Ar..." | Re-trigger Greptile}

Felarof (felarof99) and others added 11 commits April 28, 2026 17:05

feat(eval2): scaffold package skeleton

ffd15cf

feat(eval2): copy agisdk eval assets and shared types

3449a44

feat(server): pass AI SDK telemetry into ToolLoopAgent

73ed229

feat(eval2): add Laminar traced runner

d35167e

fix(eval2): harden task failure handling and TS types

53fb95c

fix(eval2): parse agisdk sidecar JSON from noisy output

2846bd3

test(eval2): add mini agisdk smoke config

68be370

test(eval2): use gpt-4.1 for agisdk smoke configs

a4b1866

fix(eval2): make Laminar smoke traces stable

e58677a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(eval2): swap Laminar tracing for Arize Phoenix#873

feat(eval2): swap Laminar tracing for Arize Phoenix#873
Felarof (felarof99) wants to merge 11 commits intodevfrom
feat/phoenix-evals

Felarof (felarof99) commented Apr 29, 2026

Uh oh!

greptile-apps Bot commented Apr 29, 2026 •

edited

Loading

Comments Outside Diff (2)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Felarof (felarof99) commented Apr 29, 2026

Summary

Design

Test plan

Uh oh!

greptile-apps Bot commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Comments Outside Diff (2)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

greptile-apps Bot commented Apr 29, 2026 •

edited

Loading