From 26d66c108c061db4df074b6dabaff30d441bb69f Mon Sep 17 00:00:00 2001 From: Mark Backman Date: Wed, 14 Jan 2026 10:15:53 -0500 Subject: [PATCH 1/2] Add example for on_user_turn_stop_timeout --- getting-started/quickstart.mdx | 2 +- server/utilities/turn-events.mdx | 8 ++++++-- 2 files changed, 7 insertions(+), 3 deletions(-) diff --git a/getting-started/quickstart.mdx b/getting-started/quickstart.mdx index 100a7dc3..0a6b3f4d 100644 --- a/getting-started/quickstart.mdx +++ b/getting-started/quickstart.mdx @@ -268,7 +268,7 @@ The context aggregator automatically collects user messages (after speech-to-tex When building web or mobile clients, you can use [Pipecat's client SDKs](/client/introduction) that communicate with your bot via the [RTVI (Real-Time Voice Interaction) protocol](/client/rtvi-standard). In our quickstart example, we initialize the RTVI processor to handle client-server messaging and events: ```python -rtvi = RTVIProcessor(config=RTVIConfig(config=[])) +rtvi = RTVIProcessor() ``` See below for how we incorporate the RTVI processor into the pipeline. diff --git a/server/utilities/turn-events.mdx b/server/utilities/turn-events.mdx index 0fd918b8..436a93c7 100644 --- a/server/utilities/turn-events.mdx +++ b/server/utilities/turn-events.mdx @@ -65,12 +65,16 @@ async def on_user_turn_stopped(aggregator, strategy, message: UserTurnStoppedMes ### on_user_turn_stop_timeout -Fired when a user turn times out without any stop strategy triggering. This is a fallback mechanism that ends the turn after a configurable timeout period (default: 5.0 seconds) when the user has stopped speaking according to VAD but no transcription-based stop has occurred. +Fired when a user turn times out without any stop strategy triggering. This is a fallback mechanism that ends the turn after a configurable timeout period (default: 5.0 seconds) when the user has stopped speaking according to VAD but no transcription-based stop has occurred. Commonly, this event is used to retrigger the LLM response after the user has stopped speaking. ```python @user_aggregator.event_handler("on_user_turn_stop_timeout") async def on_user_turn_stop_timeout(aggregator): - print("User turn ended due to timeout") + message = { + "role": "system", + "content": "Continue.", + } + await user_aggregator.queue_frame(LLMMessagesAppendFrame([message], run_llm=True)) ``` **Parameters:** From c44f6f114333003772cdf6e67871930197531b46 Mon Sep 17 00:00:00 2001 From: Mark Backman Date: Wed, 14 Jan 2026 10:21:41 -0500 Subject: [PATCH 2/2] Linting fixes --- .../fundamentals/custom-frame-processor.mdx | 2 + guides/fundamentals/user-input-muting.mdx | 11 +++- guides/learn/context-management.mdx | 5 +- guides/learn/pipeline.mdx | 4 +- guides/learn/speech-input.mdx | 10 +++- guides/learn/text-to-speech.mdx | 12 +++- guides/learn/transports.mdx | 4 +- server/pipeline/pipeline-params.mdx | 8 ++- server/pipeline/pipeline-task.mdx | 18 ++++-- server/utilities/external-turn-management.mdx | 17 ++++-- server/utilities/interruption-strategies.mdx | 5 +- .../utilities/smart-turn/fal-smart-turn.mdx | 7 ++- .../smart-turn/smart-turn-overview.mdx | 8 +-- server/utilities/transcript-processor.mdx | 7 ++- server/utilities/transcriptions.mdx | 1 + server/utilities/turn-events.mdx | 52 +++++++++--------- server/utilities/user-mute-strategies.mdx | 16 +++++- server/utilities/user-turn-strategies.mdx | 55 ++++++++++++++----- 18 files changed, 169 insertions(+), 73 deletions(-) diff --git a/guides/fundamentals/custom-frame-processor.mdx b/guides/fundamentals/custom-frame-processor.mdx index fdf82969..7d879769 100644 --- a/guides/fundamentals/custom-frame-processor.mdx +++ b/guides/fundamentals/custom-frame-processor.mdx @@ -34,6 +34,7 @@ class MetricsFrameLogger(FrameProcessor): This frame processor looks for `MetricsFrames`. When it sees one, it formats the data and logs it. It uses this `format_metrics` function: + ```python def format_metrics(metrics, indent=0): lines = [] @@ -78,6 +79,7 @@ pipeline = Pipeline( metrics_frame_processor, # Our custom FrameProcessor that pretty prints metrics frames ] ) +``` With this positioning, the `MetricsFrameLogger` FrameProcessor will receive every MetericsFrame in the pipeline. diff --git a/guides/fundamentals/user-input-muting.mdx b/guides/fundamentals/user-input-muting.mdx index d35143e2..98a067dc 100644 --- a/guides/fundamentals/user-input-muting.mdx +++ b/guides/fundamentals/user-input-muting.mdx @@ -38,7 +38,11 @@ This prevents user speech from being processed during muted periods. Pipecat provides several built-in strategies for determining when to mute user input: - + Mute only during the bot's first speech utterance. Useful for introductions when you want the bot to complete its greeting before the user can speak. @@ -61,8 +65,9 @@ Pipecat provides several built-in strategies for determining when to mute user i - The `FirstSpeechUserMuteStrategy` and `MuteUntilFirstBotCompleteUserMuteStrategy` strategies should not - be used together as they handle the first bot speech differently. + The `FirstSpeechUserMuteStrategy` and + `MuteUntilFirstBotCompleteUserMuteStrategy` strategies should not be used + together as they handle the first bot speech differently. ## Basic Implementation diff --git a/guides/learn/context-management.mdx b/guides/learn/context-management.mdx index 408c54af..f827551f 100644 --- a/guides/learn/context-management.mdx +++ b/guides/learn/context-management.mdx @@ -68,7 +68,10 @@ user_aggregator, assistant_aggregator = LLMContextAggregatorPair(context) ``` -The context aggregator also supports configuring [user turn strategies](/server/utilities/user-turn-strategies) and [user mute strategies](/server/utilities/user-mute-strategies) via `LLMUserAggregatorParams`. + The context aggregator also supports configuring [user turn + strategies](/server/utilities/user-turn-strategies) and [user mute + strategies](/server/utilities/user-mute-strategies) via + `LLMUserAggregatorParams`. **About LLMContext:** diff --git a/guides/learn/pipeline.mdx b/guides/learn/pipeline.mdx index c0a3a9e9..186d2384 100644 --- a/guides/learn/pipeline.mdx +++ b/guides/learn/pipeline.mdx @@ -160,8 +160,8 @@ Understanding data flow is crucial for building effective pipelines: 4. `tts` converts text frames to `TTSAudioRawFrame`s, `AggregatedTextFrame`s, and `TTSTextFrame`s 5. `transport.output()` creates `OutputAudioRawFrame`s and sends audio back to user -* Note: An `LLMTextProcessor` can sit between the `llm` and `tts` to pre-aggregate `LLMTextFrame`s into `AggregatedTextFrame`s. This simply moves the aggregation step -out of the TTS. +- Note: An `LLMTextProcessor` can sit between the `llm` and `tts` to pre-aggregate `LLMTextFrame`s into `AggregatedTextFrame`s. This simply moves the aggregation step + out of the TTS. ### Frame Propagation diff --git a/guides/learn/speech-input.mdx b/guides/learn/speech-input.mdx index a6b678c5..cd07292b 100644 --- a/guides/learn/speech-input.mdx +++ b/guides/learn/speech-input.mdx @@ -10,11 +10,13 @@ A key to natural conversations is properly detecting when the user starts and st Pipecat uses [user turn strategies](/server/utilities/user-turn-strategies) to determine when user turns start and end. These strategies can use different techniques: **For detecting turn start:** + - Voice Activity Detection (VAD): triggers when speech is detected - Transcription-based (fallback): triggers when transcription is received but VAD didn't detect speech - Minimum words: waits for a minimum number of spoken words before triggering **For detecting turn end:** + - Transcription-based: analyzes transcription to determine when the user is done - Turn detection model: uses AI to understand if the user has finished their thought @@ -90,7 +92,9 @@ While VAD detects speech vs. silence, it can't understand linguistic context. A 2. **Turn End**: When the stop strategy determines the user is done, it emits `UserStoppedSpeakingFrame` -VAD also emits its own frames (`VADUserStartedSpeakingFrame`, `VADUserStoppedSpeakingFrame`) which indicate raw speech/silence detection. These are inputs to the turn strategies, not the final turn decisions. + VAD also emits its own frames (`VADUserStartedSpeakingFrame`, + `VADUserStoppedSpeakingFrame`) which indicate raw speech/silence detection. + These are inputs to the turn strategies, not the final turn decisions. ### Detecting Turn End @@ -141,6 +145,7 @@ When using Smart Turn, configure VAD with a low `stop_secs` (0.2) so the model c Interruptions stop the bot when the user starts speaking. This is controlled by the `enable_interruptions` parameter on start strategies (enabled by default). When a user turn starts with interruptions enabled: + 1. Bot immediately stops speaking 2. Pending audio and text is cleared 3. Pipeline ready for new user input @@ -154,7 +159,8 @@ start_strategy = VADUserTurnStartStrategy(enable_interruptions=False) ``` -Keep interruptions enabled (default) for natural conversations. This enables users to interrupt the bot mid-sentence, just like human conversations. + Keep interruptions enabled (default) for natural conversations. This enables + users to interrupt the bot mid-sentence, just like human conversations. ## Best Practices diff --git a/guides/learn/text-to-speech.mdx b/guides/learn/text-to-speech.mdx index c8d2fcbc..ceb7e363 100644 --- a/guides/learn/text-to-speech.mdx +++ b/guides/learn/text-to-speech.mdx @@ -181,7 +181,11 @@ For TTS-specific text preprocessing, you can provide custom text transforms that Text transforms are registered directly on the TTS service instance via the `add_text_transformer()` method or during initialization using the `text_transforms` parameter. - The intentions of text transforms are meant to be TTS-specific modifications that do not affect the underlying LLM text or context. That said, since the context aggregator attempts to base its context on what was actually spoken, for services that support word timestamps, like Cartesia, ElevenLabs, and Rime,these transforms will modify the context as they modify what is spoken. + The intentions of text transforms are meant to be TTS-specific modifications + that do not affect the underlying LLM text or context. That said, since the + context aggregator attempts to base its context on what was actually spoken, + for services that support word timestamps, like Cartesia, ElevenLabs, and + Rime,these transforms will modify the context as they modify what is spoken. ```python @@ -227,7 +231,11 @@ tts.add_text_transformer(replace_acronyms, "*") # Apply to all text ### Text Filters -Text filters are no longer the preferred method for text preprocessing and will be deprecated in future releases. Instead, you should use one of the methods described above. + + Text filters are no longer the preferred method for text preprocessing and + will be deprecated in future releases. Instead, you should use one of the + methods described above. + Apply preprocessing to text before synthesis: diff --git a/guides/learn/transports.mdx b/guides/learn/transports.mdx index 6378ffb8..f500c0df 100644 --- a/guides/learn/transports.mdx +++ b/guides/learn/transports.mdx @@ -131,7 +131,9 @@ params = TransportParams( - For advanced turn detection (like Smart Turn), configure [User Turn Strategies](/server/utilities/user-turn-strategies) on the context aggregator instead of using the transport's turn_analyzer parameter. + For advanced turn detection (like Smart Turn), configure [User Turn + Strategies](/server/utilities/user-turn-strategies) on the context aggregator + instead of using the transport's turn_analyzer parameter. - DEPRECATED: This parameter is deprecated. Configure interruption behavior via [User Turn Strategies](/server/utilities/user-turn-strategies) instead. See the `enable_interruptions` parameter on start strategies. + DEPRECATED: This parameter is deprecated. Configure interruption behavior + via [User Turn Strategies](/server/utilities/user-turn-strategies) instead. + See the `enable_interruptions` parameter on start strategies. - Whether to allow pipeline interruptions. When enabled, a user's speech will - immediately interrupt the bot's response. + Whether to allow pipeline interruptions. When enabled, a user's speech will immediately + interrupt the bot's response. diff --git a/server/pipeline/pipeline-task.mdx b/server/pipeline/pipeline-task.mdx index ad128d2f..e888f306 100644 --- a/server/pipeline/pipeline-task.mdx +++ b/server/pipeline/pipeline-task.mdx @@ -80,19 +80,27 @@ await runner.run(task) - Whether to enable OpenTelemetry tracing. See [The OpenTelemetry guide](/server/utilities/opentelemetry) for details. + Whether to enable OpenTelemetry tracing. See [The OpenTelemetry + guide](/server/utilities/opentelemetry) for details. - Whether to enable turn tracking. See [The OpenTelemetry guide](/server/utilities/opentelemetry) for details. + Whether to enable turn tracking. See [The OpenTelemetry + guide](/server/utilities/opentelemetry) for details. - Custom ID for the conversation. If not provided, a UUID will be generated. See [The OpenTelemetry guide](/server/utilities/opentelemetry) for details. + Custom ID for the conversation. If not provided, a UUID will be generated. See + [The OpenTelemetry guide](/server/utilities/opentelemetry) for details. - - Any additional attributes to add to top-level OpenTelemetry conversation span. See [The OpenTelemetry guide](/server/utilities/opentelemetry) for details. + + Any additional attributes to add to top-level OpenTelemetry conversation span. + See [The OpenTelemetry guide](/server/utilities/opentelemetry) for details. ## Methods diff --git a/server/utilities/external-turn-management.mdx b/server/utilities/external-turn-management.mdx index 6c0413de..f10adefd 100644 --- a/server/utilities/external-turn-management.mdx +++ b/server/utilities/external-turn-management.mdx @@ -8,6 +8,7 @@ description: "Handle turn detection externally using UserTurnProcessor or extern In some scenarios, turn detection happens externally, either through a dedicated processor or an external service. Pipecat provides `ExternalUserTurnStrategies`, a [user turn strategy](/server/utilities/user-turn-strategies) that defers turn handling to these external sources. External turn management might be needed when: + - **Multiple context aggregators**: Parallel pipelines with multiple LLMs need a single, shared source of turn events - **External services with turn detection**: Services like [Deepgram Flux](/server/services/stt/deepgram) or [Speechmatics](/server/services/stt/speechmatics) provide their own turn detection @@ -38,17 +39,25 @@ user_aggregator, assistant_aggregator = LLMContextAggregatorPair( `UserTurnProcessor` is a frame processor for managing user turn lifecycle when you need a single source of turn events shared across multiple context aggregators. It emits `UserStartedSpeakingFrame` and `UserStoppedSpeakingFrame` frames and handles interruptions. -`UserTurnProcessor` only manages user turn start and end events. It does not handle transcription aggregation, that remains the responsibility of the context aggregators. + `UserTurnProcessor` only manages user turn start and end events. It does not + handle transcription aggregation, that remains the responsibility of the + context aggregators. ### Constructor Parameters - - Configured strategies for starting and stopping user turns. See [User Turn Strategies](/server/utilities/user-turn-strategies) for available options. + + Configured strategies for starting and stopping user turns. See [User Turn + Strategies](/server/utilities/user-turn-strategies) for available options. - Timeout in seconds to automatically stop a user turn if no stop strategy triggers. + Timeout in seconds to automatically stop a user turn if no stop strategy + triggers. ### Event Handlers diff --git a/server/utilities/interruption-strategies.mdx b/server/utilities/interruption-strategies.mdx index 628cab47..e272d822 100644 --- a/server/utilities/interruption-strategies.mdx +++ b/server/utilities/interruption-strategies.mdx @@ -3,7 +3,10 @@ title: "Interruption Strategies" description: "Configure when users can interrupt the bot to prevent unwanted interruptions from brief affirmations" --- -DEPRECATED Interruption strategies have been deprecated in favor of [User Turn Strategies](/server/utilities/user-turn-strategies). + + DEPRECATED Interruption strategies have been deprecated in favor of [User Turn + Strategies](/server/utilities/user-turn-strategies). + ## Overview diff --git a/server/utilities/smart-turn/fal-smart-turn.mdx b/server/utilities/smart-turn/fal-smart-turn.mdx index 08454ba9..73a11896 100644 --- a/server/utilities/smart-turn/fal-smart-turn.mdx +++ b/server/utilities/smart-turn/fal-smart-turn.mdx @@ -3,7 +3,12 @@ title: "Fal Smart Turn" description: "Cloud-hosted Smart Turn detection using Fal.ai" --- - DEPRECATED: `FalSmartTurnAnalyzer` is deprecated. Please use [LocalSmartTurnAnalyzerV3](/server/utilities/smart-turn/smart-turn-overview#local-smart-turn) instead, which provides fast CPU inference without requiring external API calls. + + DEPRECATED: `FalSmartTurnAnalyzer` is deprecated. Please use + [LocalSmartTurnAnalyzerV3](/server/utilities/smart-turn/smart-turn-overview#local-smart-turn) + instead, which provides fast CPU inference without requiring external API + calls. + ## Overview diff --git a/server/utilities/smart-turn/smart-turn-overview.mdx b/server/utilities/smart-turn/smart-turn-overview.mdx index b5b6dc3a..2f1de2a0 100644 --- a/server/utilities/smart-turn/smart-turn-overview.mdx +++ b/server/utilities/smart-turn/smart-turn-overview.mdx @@ -120,8 +120,9 @@ The `LocalSmartTurnAnalyzerV3` runs inference locally. Version 3 of the model su Path to the Smart Turn v3 ONNX file containing the model weights. Download this from https://huggingface.co/pipecat-ai/smart-turn-v3/tree/main - This parameter is optional, as Pipecat includes a copy of the model internally, and this - is used if the path is unset. +This parameter is optional, as Pipecat includes a copy of the model internally, and this +is used if the path is unset. + @@ -169,15 +170,12 @@ user_aggregator, assistant_aggregator = LLMContextAggregatorPair( ) ``` - ## How It Works Smart Turn Detection continuously analyzes audio streams to identify natural turn completion points: 1. **Audio Buffering**: The system continuously buffers audio with timestamps, maintaining a small buffer of pre-speech audio. - 2. **VAD Processing**: Voice Activity Detection (using the Silero model) detects when there is a pause in the user's speech. - 3. **Smart Turn Analysis**: When VAD detects a pause in speech, the Smart Turn model analyzes the audio from the most recent 8 seconds of the user's turn, and makes a decision about whether the turn is complete or incomplete. The system includes a fallback mechanism: if a turn is classified as incomplete but silence continues for longer than `stop_secs`, the turn is automatically marked as complete. diff --git a/server/utilities/transcript-processor.mdx b/server/utilities/transcript-processor.mdx index e641910d..56d7ce9b 100644 --- a/server/utilities/transcript-processor.mdx +++ b/server/utilities/transcript-processor.mdx @@ -3,7 +3,12 @@ title: "TranscriptProcessor" description: "Factory for creating and managing conversation transcript processors with shared event handling" --- - DEPRECATED: TranscriptProcessor has been deprecated. Use `on_user_turn_stopped` and `on_assistant_turn_stopped` events on the context aggregators to collect transcriptions, see [Transcriptions](/server/utilities/transcriptions) for details. + + DEPRECATED: TranscriptProcessor has been deprecated. Use + `on_user_turn_stopped` and `on_assistant_turn_stopped` events on the context + aggregators to collect transcriptions, see + [Transcriptions](/server/utilities/transcriptions) for details. + ## Overview diff --git a/server/utilities/transcriptions.mdx b/server/utilities/transcriptions.mdx index 3b9c94eb..cc8cfcd3 100644 --- a/server/utilities/transcriptions.mdx +++ b/server/utilities/transcriptions.mdx @@ -8,6 +8,7 @@ description: "Collect user and assistant conversation transcripts using turn eve Pipecat provides a straightforward way to collect conversation transcriptions using [turn events](/server/utilities/turn-events). When a user or assistant turn ends, the corresponding event includes the complete transcript for that turn. The key events for transcription collection are: + - **`on_user_turn_stopped`** - Provides the user's complete transcript via `UserTurnStoppedMessage` - **`on_assistant_turn_stopped`** - Provides the assistant's complete transcript via `AssistantTurnStoppedMessage` diff --git a/server/utilities/turn-events.mdx b/server/utilities/turn-events.mdx index 436a93c7..4dfe697f 100644 --- a/server/utilities/turn-events.mdx +++ b/server/utilities/turn-events.mdx @@ -13,12 +13,12 @@ Turn events provide hooks into the conversation turn lifecycle, allowing you to ## Events Summary -| Event | Emitter | Description | -|-------|---------|-------------| -| `on_user_turn_started` | `user_aggregator` | User begins speaking | -| `on_user_turn_stopped` | `user_aggregator` | User finishes speaking (includes transcript) | -| `on_user_turn_stop_timeout` | `user_aggregator` | User turn ended due to timeout | -| `on_assistant_turn_started` | `assistant_aggregator` | Assistant begins responding | +| Event | Emitter | Description | +| --------------------------- | ---------------------- | --------------------------------------------------- | +| `on_user_turn_started` | `user_aggregator` | User begins speaking | +| `on_user_turn_stopped` | `user_aggregator` | User finishes speaking (includes transcript) | +| `on_user_turn_stop_timeout` | `user_aggregator` | User turn ended due to timeout | +| `on_assistant_turn_started` | `assistant_aggregator` | Assistant begins responding | | `on_assistant_turn_stopped` | `assistant_aggregator` | Assistant finishes responding (includes transcript) | ## User Turn Events @@ -37,10 +37,10 @@ async def on_user_turn_started(aggregator, strategy): **Parameters:** -| Parameter | Type | Description | -|-----------|------|-------------| -| `aggregator` | `LLMUserAggregator` | The user aggregator instance | -| `strategy` | `BaseUserTurnStartStrategy` | The strategy that triggered the turn start | +| Parameter | Type | Description | +| ------------ | --------------------------- | ------------------------------------------ | +| `aggregator` | `LLMUserAggregator` | The user aggregator instance | +| `strategy` | `BaseUserTurnStartStrategy` | The strategy that triggered the turn start | ### on_user_turn_stopped @@ -57,11 +57,11 @@ async def on_user_turn_stopped(aggregator, strategy, message: UserTurnStoppedMes **Parameters:** -| Parameter | Type | Description | -|-----------|------|-------------| -| `aggregator` | `LLMUserAggregator` | The user aggregator instance | -| `strategy` | `BaseUserTurnStopStrategy` | The strategy that triggered the turn stop | -| `message` | `UserTurnStoppedMessage` | Contains the user's transcript and metadata | +| Parameter | Type | Description | +| ------------ | -------------------------- | ------------------------------------------- | +| `aggregator` | `LLMUserAggregator` | The user aggregator instance | +| `strategy` | `BaseUserTurnStopStrategy` | The strategy that triggered the turn stop | +| `message` | `UserTurnStoppedMessage` | Contains the user's transcript and metadata | ### on_user_turn_stop_timeout @@ -79,12 +79,13 @@ async def on_user_turn_stop_timeout(aggregator): **Parameters:** -| Parameter | Type | Description | -|-----------|------|-------------| +| Parameter | Type | Description | +| ------------ | ------------------- | ---------------------------- | | `aggregator` | `LLMUserAggregator` | The user aggregator instance | -After `on_user_turn_stop_timeout` fires, `on_user_turn_stopped` will also be called with the accumulated transcript. + After `on_user_turn_stop_timeout` fires, `on_user_turn_stopped` will also be + called with the accumulated transcript. ## Assistant Turn Events @@ -103,8 +104,8 @@ async def on_assistant_turn_started(aggregator): **Parameters:** -| Parameter | Type | Description | -|-----------|------|-------------| +| Parameter | Type | Description | +| ------------ | ------------------------ | --------------------------------- | | `aggregator` | `LLMAssistantAggregator` | The assistant aggregator instance | ### on_assistant_turn_stopped @@ -120,13 +121,14 @@ async def on_assistant_turn_stopped(aggregator, message: AssistantTurnStoppedMes **Parameters:** -| Parameter | Type | Description | -|-----------|------|-------------| -| `aggregator` | `LLMAssistantAggregator` | The assistant aggregator instance | -| `message` | `AssistantTurnStoppedMessage` | Contains the assistant's transcript and metadata | +| Parameter | Type | Description | +| ------------ | ----------------------------- | ------------------------------------------------ | +| `aggregator` | `LLMAssistantAggregator` | The assistant aggregator instance | +| `message` | `AssistantTurnStoppedMessage` | Contains the assistant's transcript and metadata | -This event fires when the LLM response completes, when the user interrupts, or when a user image is appended to context. + This event fires when the LLM response completes, when the user interrupts, or + when a user image is appended to context. ## Message Types diff --git a/server/utilities/user-mute-strategies.mdx b/server/utilities/user-mute-strategies.mdx index a46589c8..a899f9fd 100644 --- a/server/utilities/user-mute-strategies.mdx +++ b/server/utilities/user-mute-strategies.mdx @@ -47,6 +47,7 @@ strategy = AlwaysUserMuteStrategy() ``` **Behavior:** + - Mutes when `BotStartedSpeakingFrame` is received - Unmutes when `BotStoppedSpeakingFrame` is received @@ -61,12 +62,14 @@ strategy = FirstSpeechUserMuteStrategy() ``` **Behavior:** + - Allows user input before bot speaks - Mutes during the first bot speech only - Unmutes permanently after first speech completes -Use this strategy when you want to ensure the bot's greeting or initial response isn't interrupted, but allow normal interruptions afterward. + Use this strategy when you want to ensure the bot's greeting or initial + response isn't interrupted, but allow normal interruptions afterward. ### MuteUntilFirstBotCompleteUserMuteStrategy @@ -80,12 +83,15 @@ strategy = MuteUntilFirstBotCompleteUserMuteStrategy() ``` **Behavior:** + - Mutes immediately when the pipeline starts (before bot speaks) - Remains muted until first `BotStoppedSpeakingFrame` is received - Unmutes permanently after first speech completes -Unlike `FirstSpeechUserMuteStrategy`, this strategy mutes user input even before the bot starts speaking. Use this when you don't want to process any user input until the bot has delivered its initial message. + Unlike `FirstSpeechUserMuteStrategy`, this strategy mutes user input even + before the bot starts speaking. Use this when you don't want to process any + user input until the bot has delivered its initial message. ### FunctionCallUserMuteStrategy @@ -99,12 +105,15 @@ strategy = FunctionCallUserMuteStrategy() ``` **Behavior:** + - Mutes when `FunctionCallsStartedFrame` is received - Tracks multiple concurrent function calls - Unmutes when all function calls complete (via `FunctionCallResultFrame` or `FunctionCallCancelFrame`) -This strategy is particularly useful when function calls trigger external API requests or database operations that may take several seconds to complete and you don't want to the user to interrupt the output. + This strategy is particularly useful when function calls trigger external API + requests or database operations that may take several seconds to complete and + you don't want to the user to interrupt the output. ## Combining Multiple Strategies @@ -124,6 +133,7 @@ user_aggregator, assistant_aggregator = LLMContextAggregatorPair( ``` In this example, user input is muted: + - From pipeline start until the bot completes its first speech - Whenever function calls are executing (even after first speech) diff --git a/server/utilities/user-turn-strategies.mdx b/server/utilities/user-turn-strategies.mdx index ffa6b2e7..c7b50a24 100644 --- a/server/utilities/user-turn-strategies.mdx +++ b/server/utilities/user-turn-strategies.mdx @@ -8,6 +8,7 @@ description: "Configure how user turns are detected and managed in conversations User turn strategies provide fine-grained control over how user speaking turns are detected in conversations. They determine when a user's turn starts (user begins speaking) and when it stops (user finishes speaking and expects a response). By default, Pipecat uses a combination of VAD (Voice Activity Detection) and transcription-based detection: + - **Start**: VAD detection or transcription received - **Stop**: Transcription received after VAD indicates silence @@ -16,6 +17,7 @@ You can customize this behavior by providing your own strategies for more sophis ## How It Works 1. **Turn Start Detection**: When any start strategy triggers, the user aggregator: + - Marks the start of a user turn - Optionally emits `UserStartedSpeakingFrame` - Optionally emits an interruption frame (if the bot is speaking) @@ -23,6 +25,7 @@ You can customize this behavior by providing your own strategies for more sophis 2. **During User Turn**: The aggregator collects transcriptions and audio frames. 3. **Turn Stop Detection**: When a stop strategy triggers, the user aggregator: + - Marks the end of the user turn - Emits `UserStoppedSpeakingFrame` - Pushes the aggregated user message to the LLM context @@ -62,11 +65,14 @@ Start strategies determine when a user's turn begins. Multiple strategies can be All start strategies inherit these parameters: - If True, the user aggregator will emit an interruption frame when the user turn starts, allowing the user to interrupt the bot. + If True, the user aggregator will emit an interruption frame when the user + turn starts, allowing the user to interrupt the bot. - If True, the user aggregator will emit frames indicating when the user starts speaking. Disable this if another component (e.g., an STT service) already generates these frames. + If True, the user aggregator will emit frames indicating when the user starts + speaking. Disable this if another component (e.g., an STT service) already + generates these frames. ### VADUserTurnStartStrategy @@ -84,7 +90,8 @@ strategy = VADUserTurnStartStrategy() Triggers a user turn start when a transcription is received. This serves as a fallback for scenarios where VAD-based detection fails (e.g., when the user speaks very softly) but the STT service still produces transcriptions. - Whether to trigger on interim (partial) transcription frames for earlier detection. + Whether to trigger on interim (partial) transcription frames for earlier + detection. ```python @@ -113,7 +120,9 @@ strategy = MinWordsUserTurnStartStrategy(min_words=3) ``` - When the bot is not speaking, this strategy will trigger after just 1 word. The `min_words` threshold only applies when the bot is actively speaking, preventing short affirmations from interrupting the bot. + When the bot is not speaking, this strategy will trigger after just 1 word. + The `min_words` threshold only applies when the bot is actively speaking, + preventing short affirmations from interrupting the bot. ### ExternalUserTurnStartStrategy @@ -127,7 +136,9 @@ strategy = ExternalUserTurnStartStrategy() ``` - This strategy automatically sets `enable_interruptions=False` and `enable_user_speaking_frames=False` since these are expected to be handled by the external processor. + This strategy automatically sets `enable_interruptions=False` and + `enable_user_speaking_frames=False` since these are expected to be handled by + the external processor. ## Stop Strategies @@ -139,7 +150,8 @@ Stop strategies determine when a user's turn ends and the bot should respond. All stop strategies inherit these parameters: - If True, the aggregator will emit frames indicating when the user stops speaking. Disable this if another component already generates these frames. + If True, the aggregator will emit frames indicating when the user stops + speaking. Disable this if another component already generates these frames. ### TranscriptionUserTurnStopStrategy @@ -147,7 +159,8 @@ All stop strategies inherit these parameters: The default stop strategy that signals the end of a user turn when transcription is received and VAD indicates silence. - A short delay in seconds used to handle consecutive or slightly delayed transcriptions gracefully. + A short delay in seconds used to handle consecutive or slightly delayed + transcriptions gracefully. ```python @@ -165,7 +178,8 @@ Uses an AI-powered turn detection model to determine when the user has finished - A short delay in seconds used to handle consecutive or slightly delayed transcriptions. + A short delay in seconds used to handle consecutive or slightly delayed + transcriptions. ```python @@ -178,7 +192,9 @@ strategy = TurnAnalyzerUserTurnStopStrategy( ``` - See the [Smart Turn Detection](/server/utilities/smart-turn/smart-turn-overview) documentation for more information on available turn analyzers. + See the [Smart Turn + Detection](/server/utilities/smart-turn/smart-turn-overview) documentation for + more information on available turn analyzers. ### ExternalUserTurnStopStrategy @@ -186,7 +202,8 @@ strategy = TurnAnalyzerUserTurnStopStrategy( Delegates turn stop detection to an external processor. This strategy listens for `UserStoppedSpeakingFrame` frames emitted by other components in the pipeline. - A short delay in seconds used to handle consecutive or slightly delayed transcriptions. + A short delay in seconds used to handle consecutive or slightly delayed + transcriptions. ```python @@ -199,12 +216,22 @@ strategy = ExternalUserTurnStopStrategy() Container for configuring user turn start and stop strategies. - - List of strategies used to detect when the user starts speaking. The first strategy to trigger will signal the start of the user's turn. + + List of strategies used to detect when the user starts speaking. The first + strategy to trigger will signal the start of the user's turn. - - List of strategies used to detect when the user stops speaking and expects a response. + + List of strategies used to detect when the user stops speaking and expects a + response. ## ExternalUserTurnStrategies