Skip to content

AudioRecognition leaks all inbound audio frames when using realtime_llm turn detection without local VAD/STT #1462

@jakswa

Description

@jakswa

edit: dang usually claude labels itself -- this issue was fired up by Claude Code

AudioRecognition unconditionally tee()s the primary audio input stream into vadInputStream and sttInputStream branches, even when no VAD or STT consumer is configured. When using realtime_llm turn detection with a RealtimeModel (no local VAD or STT), both branches go unread and their internal buffers grow for the lifetime of the call.

Impact: ~300 KB/s RSS growth per active call. A 5-minute call retains ~90 MB; an 8-minute call ~150 MB. Memory drops cleanly on call end (not a cross-call leak), but under concurrent load this exhausts worker memory.

Heap snapshot evidence

Comparison view, 30s baseline → 300s (single call, @livekit/agents 1.4.0):

Constructor # New # Freed # Delta Size Delta
ArrayBuffer 27,164 0 +27,164 +24.3 MB
AudioFrame 27,164 0 +27,164 +1.7 MB
Int16Array 27,164 0 +27,164 +1.3 MB

27,164 frames / 270s = ~100 fps, matching rtc-node delivering 10ms chunks at 24kHz. Zero freed — every frame from the window is retained.

Retainer chain for a retained AudioFrame:

AudioFrame → {value, done: false} → Deque (auto-grown to 32,768 slots)
  → ReadableStreamDefaultController → ReadableStream
    → vadInputStream in AudioRecognition
      → audioRecognition in AgentActivity → AgentSession (GC root)

Config that triggers it

const session = new voice.AgentSession({
  llm: new openai.realtime.RealtimeModel({ /* ... */ }),
  // No vad — server handles turn detection
  // No stt — realtime model handles transcription
});

With turnDetection: 'realtime_llm'. This is a valid, supported config — AgentActivity.resolveInterruptionDetector() accepts it without warning.

Root cause

audio_recognition.ts lines 309–328 — both branches of the if (interruptionDetection) block create vadInputStream via tee() regardless of whether a VAD consumer exists. createVadTask() bails immediately with if (!vad) return;, so the branch is never drained. Same pattern for sttInputStreamstartSttTasks() bails with if (!this.stt) return;, leaving the merged stream unread.

mergeReadableStreams eagerly reads from its inputs and enqueues into the output controller, so even though primaryInputStream is consumed by the merge, the output queue grows unbounded when nobody reads from sttInputStream.

Related

Separately: Promise.race reaction leak in realtime_model.ts

RealtimeSession.forwardEvents() calls Promise.race([messageChannel.get(), abortFuture.await]) in a while loop. Each iteration attaches a new reaction record to the long-lived abortFuture.await promise, retaining the resolved event (including base64 audio strings). Adds ~17 MB per 5-min call. Queue.get() already accepts { signal } — using that instead eliminates the leak.

Fix

We have a working fix at jakswa/agents-js:fix/drain-unused-audio-streams (2 files, +25/−9 lines): skip the tee() when no consumer exists for a branch, drain primaryInputStream when neither VAD nor STT is configured, and replace Promise.race with Queue.get({ signal }). Feel free to cherry-pick, rewrite, or ignore — just want the fix to land in some form.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions