Skip to content

Commit 5cc5e26

Browse files
committed
refactor
Signed-off-by: stevehuang52 <[email protected]>
1 parent f31bd13 commit 5cc5e26

File tree

8 files changed

+127
-98
lines changed

8 files changed

+127
-98
lines changed

examples/voice_agent/README.md

Lines changed: 15 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -15,11 +15,11 @@ A [Pipecat](https://github.com/pipecat-ai/pipecat) example demonstrating the sim
1515

1616

1717
## 💡 Upcoming Next
18-
- More accurate and noise-robust streaming ASR and diarization models.
18+
- More accurate and noise-robust streaming ASR models.
1919
- Faster EOU detection and backchannel handling (e.g., bot will not stop speaking when user is saying something like "uhuh", "wow", "i see").
20-
- Better streaming ASR and diarization pipeline.
20+
- Better streaming ASR and speaker diarization pipeline.
2121
- Better TTS model with more natural voice.
22-
- Joint ASR and diarization model.
22+
- Joint ASR and speaker diarization model.
2323
- Function calling, RAG, etc.
2424

2525

@@ -61,7 +61,7 @@ Alternatively, you can install the dependencies manually in an existing environm
6161
```bash
6262
pip install -r requirements.txt
6363
```
64-
The incompatability errors from pip can be ignored, if any.
64+
The incompatibility errors from pip can be ignored.
6565

6666
### Configure the server
6767

@@ -119,13 +119,13 @@ Please refer to the HuggingFace webpage of each model to configure the model par
119119

120120
### 🎤 ASR
121121

122-
We use [cache-aware streaming FastConformer](https://arxiv.org/abs/2312.17279) to transcribe the user's speech. While new models are to be released, we use the existing English models for now:
122+
We use [cache-aware streaming FastConformer](https://arxiv.org/abs/2312.17279) to transcribe the user's speech into text. While new models will be released soon, we use the existing English models for now:
123123
- [stt_en_fastconformer_hybrid_large_streaming_80ms](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/stt_en_fastconformer_hybrid_large_streaming_80ms) (default)
124124
- [nvidia/stt_en_fastconformer_hybrid_large_streaming_multi](https://huggingface.co/nvidia/stt_en_fastconformer_hybrid_large_streaming_multi)
125125

126-
### 💬 Diarization
126+
### 💬 Speaker Diarization
127127

128-
We use [streaming Sortformer](http://arxiv.org/abs/2507.18446) to detect the speaker for each user turn. As of now, we only support detecting 1 speaker for a single user turn, but different turns can be from different speakers, with a maximum of 4 speakers in the whole conversation. Currently supported models are:
128+
Speaker diarization aims to distinguish different speakers in the input speech audio. We use [streaming Sortformer](http://arxiv.org/abs/2507.18446) to detect the speaker for each user turn. As of now, we only support detecting 1 speaker for a single user turn, but different turns can be from different speakers, with a maximum of 4 speakers in the whole conversation. Currently supported models are:
129129
- [nvidia/diar_streaming_sortformer_4spk-v2](https://huggingface.co/nvidia/diar_streaming_sortformer_4spk-v2) (default)
130130

131131

@@ -136,9 +136,16 @@ Please note that in some circumstances, the diarization model might not work wel
136136
We use [FastPitch-HiFiGAN](https://huggingface.co/nvidia/tts_en_fastpitch) to generate the speech for the LLM response, and it only supports English output. More TTS models will be supported in the future.
137137

138138

139+
### Turn-taking
140+
141+
As the new turn-taking prediction model is not yet released, we use the VAD-based turn-taking prediction for now. You can set the `vad.stop_secs` to the desired value in `server/server_config.yaml` to control the amount of silence needed to indicate the end of a user's turn.
142+
143+
Additionally, the voice agent support ignoring back-channel phrases while the bot is talking, which it means phrases such as "uh-huh", "yeah", "okay" will not interrupt the bot while it's talking. To control the backchannel phrases to be used, you can set the `turn_taking.backchannel_phrases` to the desired list of phrases or a file path to a yaml file containing the list of phrases in `server/server_config.yaml`. Setting it to `null` will disable detecting the backchannel phrases, and that the VAD will interrupt the bot immediately when the user starts speaking.
144+
145+
139146
## 📝 Notes & FAQ
140147
- Only one connection to the server is supported at a time, a new connection will disconnect the previous one, but the context will be preserved.
141-
- If directly loading from HuggingFace and got I/O erros, you can set `llm.model=<local_path>`, where the model is downloaded via somehing like `huggingface-cli download Qwen/Qwen3-8B --local-dir <local_path>`. Same for TTS models.
148+
- If directly loading from HuggingFace and got I/O erros, you can set `llm.model=<local_path>`, where the model is downloaded using a command like `huggingface-cli download Qwen/Qwen3-8B --local-dir <local_path>`. Same for TTS models.
142149
- The current ASR and diarization models are not noise-robust, you might need to use a noise-cancelling microphone or a quiet environment. But we will release better models soon.
143150
- The diarization model works best with speakers that have much more different voices from each other, while it might not work well on some accents due to the limited training data.
144151
- If you see errors like `SyntaxError: Unexpected reserved word` when running `npm run dev`, please update the Node.js version.
Lines changed: 79 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,79 @@
1+
- "absolutely"
2+
- "ah"
3+
- "all right"
4+
- "alright"
5+
- "but yeah"
6+
- "cool"
7+
- "definitely"
8+
- "exactly"
9+
- "go ahead"
10+
- "good"
11+
- "great"
12+
- "great thanks"
13+
- "ha ha"
14+
- "hi"
15+
- "hmm"
16+
- "humm"
17+
- "huh"
18+
- "i know"
19+
- "i know right"
20+
- "i see"
21+
- "indeed"
22+
- "interesting"
23+
- "mhmm"
24+
- "mhmm mhmm"
25+
- "mhmm right"
26+
- "mhmm yeah"
27+
- "mhmm yes"
28+
- "mm hmm"
29+
- "mmhmm"
30+
- "nice"
31+
- "of course"
32+
- "oh"
33+
- "oh dear"
34+
- "oh man"
35+
- "oh okay"
36+
- "oh wow"
37+
- "oh yes"
38+
- "ok"
39+
- "ok thanks"
40+
- "okay"
41+
- "okay okay"
42+
- "okay thanks"
43+
- "perfect"
44+
- "really"
45+
- "right"
46+
- "right exactly"
47+
- "right right"
48+
- "right yeah"
49+
- "so yeah"
50+
- "sounds good"
51+
- "sure"
52+
- "sure thing"
53+
- "thank you"
54+
- "thanks"
55+
- "that's awesome"
56+
- "thats right"
57+
- "thats true"
58+
- "true"
59+
- "uh huh"
60+
- "uh-huh"
61+
- "uh-huh yeah"
62+
- "uhhuh"
63+
- "uhhuh okay"
64+
- "um-humm"
65+
- "well"
66+
- "what"
67+
- "wow"
68+
- "yeah"
69+
- "yeah i know"
70+
- "yeah i see"
71+
- "yeah mhmm"
72+
- "yeah okay"
73+
- "yeah right"
74+
- "yeah uh-huh"
75+
- "yeah yeah"
76+
- "yep"
77+
- "yes"
78+
- "yes please"
79+
- "yes yes"

examples/voice_agent/server/bot_websocket_server.py

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -107,6 +107,7 @@
107107

108108

109109
### Turn taking
110+
TURN_TAKING_BACKCHANNEL_PHRASES = server_config.turn_taking.backchannel_phrases
110111
TURN_TAKING_MAX_BUFFER_SIZE = server_config.turn_taking.max_buffer_size
111112
TURN_TAKING_BOT_STOP_DELAY = server_config.turn_taking.bot_stop_delay
112113

@@ -183,7 +184,8 @@ async def run_bot_websocket_server():
183184
vad_analyzer=vad_analyzer,
184185
session_timeout=None, # Disable session timeout
185186
audio_in_sample_rate=SAMPLE_RATE,
186-
can_create_user_frames=False,
187+
can_create_user_frames=TURN_TAKING_BACKCHANNEL_PHRASES
188+
is None, # if backchannel phrases are disabled, we can use VAD to interrupt the bot immediately
187189
audio_out_10ms_chunks=TRANSPORT_AUDIO_OUT_10MS_CHUNKS,
188190
),
189191
host="0.0.0.0", # Bind to all interfaces
@@ -222,6 +224,7 @@ async def run_bot_websocket_server():
222224
use_diar=USE_DIAR,
223225
max_buffer_size=TURN_TAKING_MAX_BUFFER_SIZE,
224226
bot_stop_delay=TURN_TAKING_BOT_STOP_DELAY,
227+
backchannel_phrases=TURN_TAKING_BACKCHANNEL_PHRASES,
225228
)
226229
logger.info("Turn taking service initialized")
227230

examples/voice_agent/server/server_config.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,7 @@ diar:
2929
frame_len_in_secs: 0.08 # default for FastConformer, do not change
3030

3131
turn_taking:
32+
backchannel_phrases: "./server/backchannel_phrases.yaml" # set it to the actual path of the file, or specify a list of backchannel phrases here
3233
max_buffer_size: 2 # num of words more than this amount will interrupt the LLM immediately
3334
bot_stop_delay: 0.5 # in seconds, a delay between server and client audio output
3435

nemo/agents/voice_agent/pipecat/services/nemo/legacy_asr.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@
1111
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
1212
# See the License for the specific language governing permissions and
1313
# limitations under the License.
14+
# NOTE: This file will be deprecated in the future, as the new inference pipeline will replace it.
1415

1516
import math
1617
from typing import List

nemo/agents/voice_agent/pipecat/services/nemo/legacy_diar.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@
1111
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
1212
# See the License for the specific language governing permissions and
1313
# limitations under the License.
14-
14+
# NOTE: This file will be deprecated in the future, as the new inference pipeline will replace it.
1515
from dataclasses import dataclass
1616
from typing import Optional, Tuple
1717

nemo/agents/voice_agent/pipecat/services/nemo/turn_taking.py

Lines changed: 25 additions & 88 deletions
Original file line numberDiff line numberDiff line change
@@ -13,8 +13,10 @@
1313
# limitations under the License.
1414

1515
import time
16-
from typing import List
16+
from pathlib import Path
17+
from typing import List, Optional, Union
1718

19+
import yaml
1820
from loguru import logger
1921
from pipecat.frames.frames import (
2022
BotStartedSpeakingFrame,
@@ -35,102 +37,17 @@
3537

3638
from nemo.agents.voice_agent.pipecat.frames.frames import DiarResultFrame
3739

38-
DEFAULT_BACKCHANNEL_PHRASES = [
39-
"cool",
40-
"huh",
41-
"okay okay",
42-
"mhmm",
43-
"mmhmm",
44-
'uhhuh',
45-
'uhhuh okay',
46-
'sure thing',
47-
'uh huh',
48-
'mm hmm',
49-
'hmm',
50-
'humm',
51-
'absolutely',
52-
'ah',
53-
'all right',
54-
'alright',
55-
'but yeah',
56-
'definitely',
57-
'exactly',
58-
'go ahead',
59-
'good',
60-
'great',
61-
'great thanks',
62-
'ha ha',
63-
'hi',
64-
'i know',
65-
'i know right',
66-
'i see',
67-
'indeed',
68-
'interesting',
69-
'mhmm',
70-
'mhmm mhmm',
71-
'mhmm right',
72-
'mhmm yeah',
73-
'mhmm yes',
74-
'nice',
75-
'of course',
76-
'oh',
77-
'oh dear',
78-
'oh man',
79-
'oh okay',
80-
'oh wow',
81-
'oh yes',
82-
'ok',
83-
'ok thanks',
84-
'okay',
85-
'okay okay',
86-
'okay thanks',
87-
'perfect',
88-
'really',
89-
'right',
90-
'right exactly',
91-
'right right',
92-
'right yeah',
93-
'so yeah',
94-
'sounds good',
95-
'sure',
96-
'thank you',
97-
'thanks',
98-
"that's awesome",
99-
'thats right',
100-
'thats true',
101-
'true',
102-
'uh-huh',
103-
'uh-huh yeah',
104-
'uhhuh',
105-
'um-humm',
106-
'well',
107-
'what',
108-
'wow',
109-
'yeah',
110-
'yeah i know',
111-
'yeah i see',
112-
'yeah mhmm',
113-
'yeah okay',
114-
'yeah right',
115-
'yeah uh-huh',
116-
'yeah yeah',
117-
'yep',
118-
'yes',
119-
'yes please',
120-
'yes yes',
121-
]
122-
12340

12441
class NeMoTurnTakingService(FrameProcessor):
12542
def __init__(
12643
self,
44+
backchannel_phrases: Union[str, List[str]] = None,
12745
eou_string: str = "<EOU>",
12846
eob_string: str = "<EOB>",
12947
language: Language = Language.EN_US,
13048
use_vad: bool = True,
13149
use_diar: bool = False,
13250
max_buffer_size: int = 3,
133-
backchannel_phrases: List[str] = DEFAULT_BACKCHANNEL_PHRASES,
13451
bot_stop_delay: float = 0.5,
13552
**kwargs,
13653
):
@@ -141,7 +58,8 @@ def __init__(
14158
self.use_vad = use_vad
14259
self.use_diar = use_diar
14360
self.max_buffer_size = max_buffer_size
144-
self.backchannel_phrases = backchannel_phrases
61+
62+
self.backchannel_phrases = self._load_backchannel_phrases(backchannel_phrases)
14563
self.backchannel_phrases_nopc = set([self.clean_text(phrase) for phrase in self.backchannel_phrases])
14664
self.bot_stop_delay = bot_stop_delay
14765
# internal data
@@ -156,6 +74,25 @@ def __init__(
15674
# if vad is not used, we assume the user is always speaking
15775
self._vad_user_speaking = True
15876

77+
def _load_backchannel_phrases(self, backchannel_phrases: Optional[Union[str, List[str]]] = None):
78+
if not backchannel_phrases:
79+
return []
80+
81+
if isinstance(backchannel_phrases, str) and Path(backchannel_phrases).is_file():
82+
logger.info(f"Loading backchannel phrases from file: {backchannel_phrases}")
83+
if not Path(backchannel_phrases).exists():
84+
raise FileNotFoundError(f"Backchannel phrases file not found: {backchannel_phrases}")
85+
with open(backchannel_phrases, "r") as f:
86+
backchannel_phrases = yaml.safe_load(f)
87+
if not isinstance(backchannel_phrases, list):
88+
raise ValueError(f"Backchannel phrases must be a list, got {type(backchannel_phrases)}")
89+
logger.info(f"Loaded {len(backchannel_phrases)} backchannel phrases from file: {backchannel_phrases}")
90+
elif isinstance(backchannel_phrases, list):
91+
logger.info(f"Using backchannel phrases from list: {backchannel_phrases}")
92+
else:
93+
raise ValueError(f"Invalid backchannel phrases: {backchannel_phrases}")
94+
return backchannel_phrases
95+
15996
def clean_text(self, text: str) -> str:
16097
"""
16198
Clean the text so that it can be used for backchannel detection.

nemo/agents/voice_agent/pipecat/services/nemo/utils.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@
1111
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
1212
# See the License for the specific language governing permissions and
1313
# limitations under the License.
14+
# NOTE: This file will be deprecated in the future, as the new inference pipeline will replace it.
1415

1516
import math
1617

0 commit comments

Comments
 (0)