refactor

stevehuang52 · stevehuang52 · commit 5cc5e2668335 · 2025-09-02T15:15:24.000-04:00
Signed-off-by: stevehuang52 &lt;heh@nvidia.com&gt;
diff --git a/examples/voice_agent/README.md b/examples/voice_agent/README.md
@@ -15,11 +15,11 @@ A [Pipecat](https://github.com/pipecat-ai/pipecat) example demonstrating the sim
 
 
 ## 💡 Upcoming Next
-- More accurate and noise-robust streaming ASR and diarization models.
+- More accurate and noise-robust streaming ASR models.
 - Faster EOU detection and backchannel handling (e.g., bot will not stop speaking when user is saying something like "uhuh", "wow", "i see").
-- Better streaming ASR and diarization pipeline.
+- Better streaming ASR and speaker diarization pipeline.
 - Better TTS model with more natural voice.
-- Joint ASR and diarization model.
+- Joint ASR and speaker diarization model.
 - Function calling, RAG, etc.
 
 
@@ -61,7 +61,7 @@ Alternatively, you can install the dependencies manually in an existing environm
 ```bash
 pip install -r requirements.txt
 ```
-The incompatability errors from pip can be ignored, if any.
+The incompatibility errors from pip can be ignored.
 
 ### Configure the server
 
@@ -119,13 +119,13 @@ Please refer to the HuggingFace webpage of each model to configure the model par
 
 ### 🎤 ASR 
 
-We use [cache-aware streaming FastConformer](https://arxiv.org/abs/2312.17279) to transcribe the user's speech. While new models are to be released, we use the existing English models for now:
+We use [cache-aware streaming FastConformer](https://arxiv.org/abs/2312.17279) to transcribe the user's speech into text. While new models will be released soon, we use the existing English models for now:
 - [stt_en_fastconformer_hybrid_large_streaming_80ms](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/stt_en_fastconformer_hybrid_large_streaming_80ms)  (default)
 - [nvidia/stt_en_fastconformer_hybrid_large_streaming_multi](https://huggingface.co/nvidia/stt_en_fastconformer_hybrid_large_streaming_multi)
 
-### 💬 Diarization
+### 💬 Speaker Diarization
 
-We use [streaming Sortformer](http://arxiv.org/abs/2507.18446) to detect the speaker for each user turn. As of now, we only support detecting 1 speaker for a single user turn, but different turns can be from different speakers, with a maximum of 4 speakers in the whole conversation. Currently supported models are:
+Speaker diarization aims to distinguish different speakers in the input speech audio. We use [streaming Sortformer](http://arxiv.org/abs/2507.18446) to detect the speaker for each user turn. As of now, we only support detecting 1 speaker for a single user turn, but different turns can be from different speakers, with a maximum of 4 speakers in the whole conversation. Currently supported models are:
  - [nvidia/diar_streaming_sortformer_4spk-v2](https://huggingface.co/nvidia/diar_streaming_sortformer_4spk-v2) (default)
 
 
@@ -136,9 +136,16 @@ Please note that in some circumstances, the diarization model might not work wel
 We use [FastPitch-HiFiGAN](https://huggingface.co/nvidia/tts_en_fastpitch) to generate the speech for the LLM response, and it only supports English output. More TTS models will be supported in the future.
 
 
+### Turn-taking
+
+As the new turn-taking prediction model is not yet released, we use the VAD-based turn-taking prediction for now. You can set the `vad.stop_secs` to the desired value in `server/server_config.yaml` to control the amount of silence needed to indicate the end of a user's turn.
+
+Additionally, the voice agent support ignoring back-channel phrases while the bot is talking, which it means phrases such as "uh-huh", "yeah", "okay"  will not interrupt the bot while it's talking. To control the backchannel phrases to be used, you can set the `turn_taking.backchannel_phrases` to the desired list of phrases or a file path to a yaml file containing the list of phrases in `server/server_config.yaml`. Setting it to `null` will disable detecting the backchannel phrases, and that the VAD will interrupt the bot immediately when the user starts speaking.
+
+
 ## 📝 Notes & FAQ
 - Only one connection to the server is supported at a time, a new connection will disconnect the previous one, but the context will be preserved.
-- If directly loading from HuggingFace and got I/O erros, you can set `llm.model=<local_path>`, where the model is downloaded via somehing like `huggingface-cli download Qwen/Qwen3-8B --local-dir <local_path>`. Same for TTS models.
+- If directly loading from HuggingFace and got I/O erros, you can set `llm.model=<local_path>`, where the model is downloaded using a command like `huggingface-cli download Qwen/Qwen3-8B --local-dir <local_path>`. Same for TTS models.
 - The current ASR and diarization models are not noise-robust, you might need to use a noise-cancelling microphone or a quiet environment. But we will release better models soon.
 - The diarization model works best with speakers that have much more different voices from each other, while it might not work well on some accents due to the limited training data.
 - If you see errors like `SyntaxError: Unexpected reserved word` when running `npm run dev`, please update the Node.js version.
diff --git a/examples/voice_agent/server/backchannel_phrases.yaml b/examples/voice_agent/server/backchannel_phrases.yaml
@@ -0,0 +1,79 @@
+- "absolutely"
+- "ah"
+- "all right"
+- "alright"
+- "but yeah"
+- "cool"
+- "definitely"
+- "exactly"
+- "go ahead"
+- "good"
+- "great"
+- "great thanks"
+- "ha ha"
+- "hi"
+- "hmm"
+- "humm"
+- "huh"
+- "i know"
+- "i know right"
+- "i see"
+- "indeed"
+- "interesting"
+- "mhmm"
+- "mhmm mhmm"
+- "mhmm right"
+- "mhmm yeah"
+- "mhmm yes"
+- "mm hmm"
+- "mmhmm"
+- "nice"
+- "of course"
+- "oh"
+- "oh dear"
+- "oh man"
+- "oh okay"
+- "oh wow"
+- "oh yes"
+- "ok"
+- "ok thanks"
+- "okay"
+- "okay okay"
+- "okay thanks"
+- "perfect"
+- "really"
+- "right"
+- "right exactly"
+- "right right"
+- "right yeah"
+- "so yeah"
+- "sounds good"
+- "sure"
+- "sure thing"
+- "thank you"
+- "thanks"
+- "that's awesome"
+- "thats right"
+- "thats true"
+- "true"
+- "uh huh"
+- "uh-huh"
+- "uh-huh yeah"
+- "uhhuh"
+- "uhhuh okay"
+- "um-humm"
+- "well"
+- "what"
+- "wow"
+- "yeah"
+- "yeah i know"
+- "yeah i see"
+- "yeah mhmm"
+- "yeah okay"
+- "yeah right"
+- "yeah uh-huh"
+- "yeah yeah"
+- "yep"
+- "yes"
+- "yes please"
+- "yes yes"
diff --git a/examples/voice_agent/server/bot_websocket_server.py b/examples/voice_agent/server/bot_websocket_server.py
@@ -107,6 +107,7 @@
 
 
 ### Turn taking
+TURN_TAKING_BACKCHANNEL_PHRASES = server_config.turn_taking.backchannel_phrases
 TURN_TAKING_MAX_BUFFER_SIZE = server_config.turn_taking.max_buffer_size
 TURN_TAKING_BOT_STOP_DELAY = server_config.turn_taking.bot_stop_delay
 
@@ -183,7 +184,8 @@ async def run_bot_websocket_server():
             vad_analyzer=vad_analyzer,
             session_timeout=None,  # Disable session timeout
             audio_in_sample_rate=SAMPLE_RATE,
-            can_create_user_frames=False,
+            can_create_user_frames=TURN_TAKING_BACKCHANNEL_PHRASES
+            is None,  # if backchannel phrases are disabled, we can use VAD to interrupt the bot immediately
             audio_out_10ms_chunks=TRANSPORT_AUDIO_OUT_10MS_CHUNKS,
         ),
         host="0.0.0.0",  # Bind to all interfaces
@@ -222,6 +224,7 @@ async def run_bot_websocket_server():
         use_diar=USE_DIAR,
         max_buffer_size=TURN_TAKING_MAX_BUFFER_SIZE,
         bot_stop_delay=TURN_TAKING_BOT_STOP_DELAY,
+        backchannel_phrases=TURN_TAKING_BACKCHANNEL_PHRASES,
     )
     logger.info("Turn taking service initialized")
 
diff --git a/examples/voice_agent/server/server_config.yaml b/examples/voice_agent/server/server_config.yaml
@@ -29,6 +29,7 @@ diar:
   frame_len_in_secs: 0.08  # default for FastConformer, do not change
 
 turn_taking:
+  backchannel_phrases: "./server/backchannel_phrases.yaml"  # set it to the actual path of the file, or specify a list of backchannel phrases here
   max_buffer_size: 2  # num of words more than this amount will interrupt the LLM immediately
   bot_stop_delay: 0.5  # in seconds, a delay between server and client audio output
 
diff --git a/nemo/agents/voice_agent/pipecat/services/nemo/legacy_asr.py b/nemo/agents/voice_agent/pipecat/services/nemo/legacy_asr.py
@@ -11,6 +11,7 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
+# NOTE: This file will be deprecated in the future, as the new inference pipeline will replace it.
 
 import math
 from typing import List
diff --git a/nemo/agents/voice_agent/pipecat/services/nemo/legacy_diar.py b/nemo/agents/voice_agent/pipecat/services/nemo/legacy_diar.py
@@ -11,7 +11,7 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-
+# NOTE: This file will be deprecated in the future, as the new inference pipeline will replace it.
 from dataclasses import dataclass
 from typing import Optional, Tuple
 
diff --git a/nemo/agents/voice_agent/pipecat/services/nemo/turn_taking.py b/nemo/agents/voice_agent/pipecat/services/nemo/turn_taking.py
@@ -13,8 +13,10 @@
 # limitations under the License.
 
 import time
-from typing import List
+from pathlib import Path
+from typing import List, Optional, Union
 
+import yaml
 from loguru import logger
 from pipecat.frames.frames import (
     BotStartedSpeakingFrame,
@@ -35,102 +37,17 @@
 
 from nemo.agents.voice_agent.pipecat.frames.frames import DiarResultFrame
 
-DEFAULT_BACKCHANNEL_PHRASES = [
-    "cool",
-    "huh",
-    "okay okay",
-    "mhmm",
-    "mmhmm",
-    'uhhuh',
-    'uhhuh okay',
-    'sure thing',
-    'uh huh',
-    'mm hmm',
-    'hmm',
-    'humm',
-    'absolutely',
-    'ah',
-    'all right',
-    'alright',
-    'but yeah',
-    'definitely',
-    'exactly',
-    'go ahead',
-    'good',
-    'great',
-    'great thanks',
-    'ha ha',
-    'hi',
-    'i know',
-    'i know right',
-    'i see',
-    'indeed',
-    'interesting',
-    'mhmm',
-    'mhmm mhmm',
-    'mhmm right',
-    'mhmm yeah',
-    'mhmm yes',
-    'nice',
-    'of course',
-    'oh',
-    'oh dear',
-    'oh man',
-    'oh okay',
-    'oh wow',
-    'oh yes',
-    'ok',
-    'ok thanks',
-    'okay',
-    'okay okay',
-    'okay thanks',
-    'perfect',
-    'really',
-    'right',
-    'right exactly',
-    'right right',
-    'right yeah',
-    'so yeah',
-    'sounds good',
-    'sure',
-    'thank you',
-    'thanks',
-    "that's awesome",
-    'thats right',
-    'thats true',
-    'true',
-    'uh-huh',
-    'uh-huh yeah',
-    'uhhuh',
-    'um-humm',
-    'well',
-    'what',
-    'wow',
-    'yeah',
-    'yeah i know',
-    'yeah i see',
-    'yeah mhmm',
-    'yeah okay',
-    'yeah right',
-    'yeah uh-huh',
-    'yeah yeah',
-    'yep',
-    'yes',
-    'yes please',
-    'yes yes',
-]
-
 
 class NeMoTurnTakingService(FrameProcessor):
     def __init__(
         self,
+        backchannel_phrases: Union[str, List[str]] = None,
         eou_string: str = "<EOU>",
         eob_string: str = "<EOB>",
         language: Language = Language.EN_US,
         use_vad: bool = True,
         use_diar: bool = False,
         max_buffer_size: int = 3,
-        backchannel_phrases: List[str] = DEFAULT_BACKCHANNEL_PHRASES,
         bot_stop_delay: float = 0.5,
         **kwargs,
     ):
@@ -141,7 +58,8 @@ def __init__(
         self.use_vad = use_vad
         self.use_diar = use_diar
         self.max_buffer_size = max_buffer_size
-        self.backchannel_phrases = backchannel_phrases
+
+        self.backchannel_phrases = self._load_backchannel_phrases(backchannel_phrases)
         self.backchannel_phrases_nopc = set([self.clean_text(phrase) for phrase in self.backchannel_phrases])
         self.bot_stop_delay = bot_stop_delay
         # internal data
@@ -156,6 +74,25 @@ def __init__(
             # if vad is not used, we assume the user is always speaking
             self._vad_user_speaking = True
 
+    def _load_backchannel_phrases(self, backchannel_phrases: Optional[Union[str, List[str]]] = None):
+        if not backchannel_phrases:
+            return []
+
+        if isinstance(backchannel_phrases, str) and Path(backchannel_phrases).is_file():
+            logger.info(f"Loading backchannel phrases from file: {backchannel_phrases}")
+            if not Path(backchannel_phrases).exists():
+                raise FileNotFoundError(f"Backchannel phrases file not found: {backchannel_phrases}")
+            with open(backchannel_phrases, "r") as f:
+                backchannel_phrases = yaml.safe_load(f)
+            if not isinstance(backchannel_phrases, list):
+                raise ValueError(f"Backchannel phrases must be a list, got {type(backchannel_phrases)}")
+            logger.info(f"Loaded {len(backchannel_phrases)} backchannel phrases from file: {backchannel_phrases}")
+        elif isinstance(backchannel_phrases, list):
+            logger.info(f"Using backchannel phrases from list: {backchannel_phrases}")
+        else:
+            raise ValueError(f"Invalid backchannel phrases: {backchannel_phrases}")
+        return backchannel_phrases
+
     def clean_text(self, text: str) -> str:
         """
         Clean the text so that it can be used for backchannel detection.
diff --git a/nemo/agents/voice_agent/pipecat/services/nemo/utils.py b/nemo/agents/voice_agent/pipecat/services/nemo/utils.py
@@ -11,6 +11,7 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
+# NOTE: This file will be deprecated in the future, as the new inference pipeline will replace it.
 
 import math