feat(eval2): swap Laminar tracing for Arize Phoenix#873
feat(eval2): swap Laminar tracing for Arize Phoenix#873Felarof (felarof99) wants to merge 11 commits intodevfrom
Conversation
Per the screenshots design note, capture a PNG screenshot after each tool call and emit it as a manual DEFAULT span with the base64 data URL in the output. Laminar auto-renders base64 in default span outputs, so each ai.toolCall.<name> in a trace is followed by an eval.step.screenshot sibling that shows page state at that step. Failures are non-fatal — screenshot or span emit errors do not abort the run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace @lmnr-ai/lmnr with a plain OpenTelemetry stack pointed at Phoenix's OTLP endpoint, using @arizeai/openinference-vercel to translate Vercel AI SDK spans into OpenInference semantic conventions for Phoenix's UI. - tracing.ts rewritten: NodeTracerProvider + OpenInferenceSimpleSpanProcessor + OTLPTraceExporter. Public surface preserved: initTracing, getAiSdkTelemetry, recordScreenshotSpan, flushTracing, isTracingEnabled, getTaskSessionId. withTaskTrace renamed to withTaskSession to reflect Phoenix's session model. - benchmark-config.ts: laminar block replaced with phoenix block (enabled, endpoint, apiKeyEnv?, projectName, sessionPrefix). - agisdk-mini.jsonc + agisdk-smoke.jsonc: pointed at the cloud Phoenix workspace; auth via PHOENIX_API_KEY env var. - eval-runner.ts + types.ts: laminarSessionId renamed to phoenixSessionId. - package.json: dropped @lmnr-ai/lmnr; added @arizeai/openinference-vercel, @arizeai/openinference-semantic-conventions, and core OTel packages (@opentelemetry/api, sdk-trace-node, exporter-trace-otlp-proto, resources, semantic-conventions) on the 1.30.x line. - README.md: prereqs and verify section rewritten for Phoenix. End-to-end smoke (benchmark-configs/agisdk-mini.jsonc) runs both tasks against the cloud workspace at https://app.phoenix.arize.com/s/niffler92, project browseros-eval2. Sessions land with phoenixSessionId = <runId>-<queryId>. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Greptile SummaryThis PR introduces a new
Confidence Score: 3/5Two P1 logic bugs should be fixed before merging — both cause incorrect observable behaviour (silent trace loss and misleading console output). Two independent P1 issues: one causes silent data loss (traces silently dropped when cloud API key is absent) and one produces actively misleading output (trace URL printed when tracing is disabled). Neither is catastrophic, but both undermine the core observability goal of the PR. packages/browseros-agent/apps/eval2/src/tracing.ts and packages/browseros-agent/apps/eval2/src/eval-runner.ts Important Files Changed
Sequence DiagramsequenceDiagram
participant ER as eval-runner.ts
participant SA as single-agent.ts
participant TR as tracing.ts
participant AISDK as AiSdkAgent
participant OTEL as OTel / Phoenix
ER->>TR: initTracing(config)
TR->>OTEL: Register NodeTracerProvider + OTLPTraceExporter
loop for each task
ER->>TR: withTaskSession(task, config, runId, fn)
TR->>OTEL: startActiveSpan(eval.task, {session.id, ...})
ER->>SA: agent.runTask(task)
SA->>TR: getAiSdkTelemetry(task, config, runId, conversationId)
SA->>AISDK: generate({ experimental_telemetry })
AISDK->>OTEL: emit LLM + tool-call spans
loop per tool call
SA->>SA: browser.screenshot()
SA->>TR: recordScreenshotSpan(toolCallId, toolName, base64)
TR->>OTEL: startSpan(eval.step.screenshot)
end
end
ER->>TR: flushTracing()
TR->>OTEL: forceFlush() + shutdown()
|
Summary
@lmnr-ai/lmnrwith a plain OpenTelemetry stack pointed at Phoenix's OTLP endpoint, using@arizeai/openinference-vercelto translate Vercel AI SDK spans into OpenInference semantic conventions.session.idattribute (sessionId =<runId>-<queryId>); per-tool-call screenshots stay as a manual child span with the base64 PNG inoutput.value.benchmark-configs/agisdk-mini.jsonc) runs both tasks end-to-end against the cloud workspace, summary written, traces shipped.Design
POC quality, not future-proof. The
tracing.tsmodule owns all OTel + Phoenix wiring; its public surface (initTracing,getAiSdkTelemetry,withTaskSession,recordScreenshotSpan,flushTracing,isTracingEnabled,getTaskSessionId) is preserved so call sites insingle-agent.tsneed zero edits andeval-runner.tsonly renames the helper.benchmark-config.tsswaps thelaminarZod block forphoenix(enabled,endpoint,apiKeyEnv?,projectName,sessionPrefix). Both JSONC configs (agisdk-mini.jsonc,agisdk-smoke.jsonc) point at the cloud Phoenix endpoint by default; a localphoenix serveis a one-line config change. See.llm/specs/2026-04-28-eval2-laminar-design.mdfor the broader eval2 design and.llm/0429_replace_laminar_with_phoenix/for this swap's design + PRD.Test plan
cd packages/browseros-agent/apps/eval2 && bun run typecheckexits 0.bun run eval --config benchmark-configs/agisdk-mini.jsoncruns both agisdk tasks end-to-end (FAIL on grader is expected — agent quality, not infra).results/<runId>/summary.jsonhasphoenixSessionIdpopulated for each task.browseros-eval2shows two sessions named<runId>-<queryId>, each with LLM/tool spans andeval.step.screenshotentries that render the captured PNG.🤖 Generated with Claude Code