AudioRecognition leaks all inbound audio frames when using realtime_llm turn detection without local VAD/STT

edit: dang usually claude labels itself -- this issue was fired up by Claude Code

`AudioRecognition` unconditionally `tee()`s the primary audio input stream into `vadInputStream` and `sttInputStream` branches, even when no VAD or STT consumer is configured. When using `realtime_llm` turn detection with a `RealtimeModel` (no local VAD or STT), both branches go unread and their internal buffers grow for the lifetime of the call.

**Impact:** ~300 KB/s RSS growth per active call. A 5-minute call retains ~90 MB; an 8-minute call ~150 MB. Memory drops cleanly on call end (not a cross-call leak), but under concurrent load this exhausts worker memory.

### Heap snapshot evidence

Comparison view, 30s baseline → 300s (single call, `@livekit/agents` 1.4.0):

| Constructor | # New | # Freed | # Delta | Size Delta |
|---|---|---|---|---|
| ArrayBuffer | 27,164 | 0 | +27,164 | +24.3 MB |
| AudioFrame | 27,164 | 0 | +27,164 | +1.7 MB |
| Int16Array | 27,164 | 0 | +27,164 | +1.3 MB |

27,164 frames / 270s = ~100 fps, matching rtc-node delivering 10ms chunks at 24kHz. Zero freed — every frame from the window is retained.

Retainer chain for a retained `AudioFrame`:

```
AudioFrame → {value, done: false} → Deque (auto-grown to 32,768 slots)
  → ReadableStreamDefaultController → ReadableStream
    → vadInputStream in AudioRecognition
      → audioRecognition in AgentActivity → AgentSession (GC root)
```

### Config that triggers it

```ts
const session = new voice.AgentSession({
  llm: new openai.realtime.RealtimeModel({ /* ... */ }),
  // No vad — server handles turn detection
  // No stt — realtime model handles transcription
});
```

With `turnDetection: 'realtime_llm'`. This is a valid, supported config — `AgentActivity.resolveInterruptionDetector()` accepts it without warning.

### Root cause

`audio_recognition.ts` lines 309–328 — both branches of the `if (interruptionDetection)` block create `vadInputStream` via `tee()` regardless of whether a VAD consumer exists. `createVadTask()` bails immediately with `if (!vad) return;`, so the branch is never drained. Same pattern for `sttInputStream` — `startSttTasks()` bails with `if (!this.stt) return;`, leaving the merged stream unread.

`mergeReadableStreams` eagerly reads from its inputs and enqueues into the output controller, so even though `primaryInputStream` is consumed by the merge, the output queue grows unbounded when nobody reads from `sttInputStream`.

### Related

- PR #570 introduced the current `tee()` architecture. It considered "realtime_llm requested but model lacks turn-detection capability" but not "realtime_llm valid + no VAD/STT consumers."
- PR #1201 (closed) attempted a broader `ReadableStream → Chan` migration, partly motivated by these backpressure issues.

### Separately: Promise.race reaction leak in `realtime_model.ts`

`RealtimeSession.forwardEvents()` calls `Promise.race([messageChannel.get(), abortFuture.await])` in a while loop. Each iteration attaches a new reaction record to the long-lived `abortFuture.await` promise, retaining the resolved event (including base64 audio strings). Adds ~17 MB per 5-min call. `Queue.get()` already accepts `{ signal }` — using that instead eliminates the leak.

### Fix

We have a working fix at [`jakswa/agents-js:fix/drain-unused-audio-streams`](https://github.com/jakswa/agents-js/compare/fix/drain-unused-audio-streams) (2 files, +25/−9 lines): skip the `tee()` when no consumer exists for a branch, drain `primaryInputStream` when neither VAD nor STT is configured, and replace `Promise.race` with `Queue.get({ signal })`. Feel free to cherry-pick, rewrite, or ignore — just want the fix to land in some form.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AudioRecognition leaks all inbound audio frames when using realtime_llm turn detection without local VAD/STT #1462

Heap snapshot evidence

Config that triggers it

Root cause

Related

Separately: Promise.race reaction leak in `realtime_model.ts`

Fix

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Constructor	# New	# Delta	Size Delta
ArrayBuffer	27,164	+27,164	+24.3 MB
AudioFrame	27,164	+27,164	+1.7 MB
Int16Array	27,164	+27,164	+1.3 MB

AudioRecognition leaks all inbound audio frames when using realtime_llm turn detection without local VAD/STT #1462

Description

Heap snapshot evidence

Config that triggers it

Root cause

Related

Separately: Promise.race reaction leak in realtime_model.ts

Fix

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Separately: Promise.race reaction leak in `realtime_model.ts`