edit: dang usually claude labels itself -- this issue was fired up by Claude Code
AudioRecognition unconditionally tee()s the primary audio input stream into vadInputStream and sttInputStream branches, even when no VAD or STT consumer is configured. When using realtime_llm turn detection with a RealtimeModel (no local VAD or STT), both branches go unread and their internal buffers grow for the lifetime of the call.
Impact: ~300 KB/s RSS growth per active call. A 5-minute call retains ~90 MB; an 8-minute call ~150 MB. Memory drops cleanly on call end (not a cross-call leak), but under concurrent load this exhausts worker memory.
Heap snapshot evidence
Comparison view, 30s baseline → 300s (single call, @livekit/agents 1.4.0):
| Constructor |
# New |
# Freed |
# Delta |
Size Delta |
| ArrayBuffer |
27,164 |
0 |
+27,164 |
+24.3 MB |
| AudioFrame |
27,164 |
0 |
+27,164 |
+1.7 MB |
| Int16Array |
27,164 |
0 |
+27,164 |
+1.3 MB |
27,164 frames / 270s = ~100 fps, matching rtc-node delivering 10ms chunks at 24kHz. Zero freed — every frame from the window is retained.
Retainer chain for a retained AudioFrame:
AudioFrame → {value, done: false} → Deque (auto-grown to 32,768 slots)
→ ReadableStreamDefaultController → ReadableStream
→ vadInputStream in AudioRecognition
→ audioRecognition in AgentActivity → AgentSession (GC root)
Config that triggers it
const session = new voice.AgentSession({
llm: new openai.realtime.RealtimeModel({ /* ... */ }),
// No vad — server handles turn detection
// No stt — realtime model handles transcription
});
With turnDetection: 'realtime_llm'. This is a valid, supported config — AgentActivity.resolveInterruptionDetector() accepts it without warning.
Root cause
audio_recognition.ts lines 309–328 — both branches of the if (interruptionDetection) block create vadInputStream via tee() regardless of whether a VAD consumer exists. createVadTask() bails immediately with if (!vad) return;, so the branch is never drained. Same pattern for sttInputStream — startSttTasks() bails with if (!this.stt) return;, leaving the merged stream unread.
mergeReadableStreams eagerly reads from its inputs and enqueues into the output controller, so even though primaryInputStream is consumed by the merge, the output queue grows unbounded when nobody reads from sttInputStream.
Related
Separately: Promise.race reaction leak in realtime_model.ts
RealtimeSession.forwardEvents() calls Promise.race([messageChannel.get(), abortFuture.await]) in a while loop. Each iteration attaches a new reaction record to the long-lived abortFuture.await promise, retaining the resolved event (including base64 audio strings). Adds ~17 MB per 5-min call. Queue.get() already accepts { signal } — using that instead eliminates the leak.
Fix
We have a working fix at jakswa/agents-js:fix/drain-unused-audio-streams (2 files, +25/−9 lines): skip the tee() when no consumer exists for a branch, drain primaryInputStream when neither VAD nor STT is configured, and replace Promise.race with Queue.get({ signal }). Feel free to cherry-pick, rewrite, or ignore — just want the fix to land in some form.
edit: dang usually claude labels itself -- this issue was fired up by Claude Code
AudioRecognitionunconditionallytee()s the primary audio input stream intovadInputStreamandsttInputStreambranches, even when no VAD or STT consumer is configured. When usingrealtime_llmturn detection with aRealtimeModel(no local VAD or STT), both branches go unread and their internal buffers grow for the lifetime of the call.Impact: ~300 KB/s RSS growth per active call. A 5-minute call retains ~90 MB; an 8-minute call ~150 MB. Memory drops cleanly on call end (not a cross-call leak), but under concurrent load this exhausts worker memory.
Heap snapshot evidence
Comparison view, 30s baseline → 300s (single call,
@livekit/agents1.4.0):27,164 frames / 270s = ~100 fps, matching rtc-node delivering 10ms chunks at 24kHz. Zero freed — every frame from the window is retained.
Retainer chain for a retained
AudioFrame:Config that triggers it
With
turnDetection: 'realtime_llm'. This is a valid, supported config —AgentActivity.resolveInterruptionDetector()accepts it without warning.Root cause
audio_recognition.tslines 309–328 — both branches of theif (interruptionDetection)block createvadInputStreamviatee()regardless of whether a VAD consumer exists.createVadTask()bails immediately withif (!vad) return;, so the branch is never drained. Same pattern forsttInputStream—startSttTasks()bails withif (!this.stt) return;, leaving the merged stream unread.mergeReadableStreamseagerly reads from its inputs and enqueues into the output controller, so even thoughprimaryInputStreamis consumed by the merge, the output queue grows unbounded when nobody reads fromsttInputStream.Related
tee()architecture. It considered "realtime_llm requested but model lacks turn-detection capability" but not "realtime_llm valid + no VAD/STT consumers."ReadableStream → Chanmigration, partly motivated by these backpressure issues.Separately: Promise.race reaction leak in
realtime_model.tsRealtimeSession.forwardEvents()callsPromise.race([messageChannel.get(), abortFuture.await])in a while loop. Each iteration attaches a new reaction record to the long-livedabortFuture.awaitpromise, retaining the resolved event (including base64 audio strings). Adds ~17 MB per 5-min call.Queue.get()already accepts{ signal }— using that instead eliminates the leak.Fix
We have a working fix at
jakswa/agents-js:fix/drain-unused-audio-streams(2 files, +25/−9 lines): skip thetee()when no consumer exists for a branch, drainprimaryInputStreamwhen neither VAD nor STT is configured, and replacePromise.racewithQueue.get({ signal }). Feel free to cherry-pick, rewrite, or ignore — just want the fix to land in some form.