Local text-to-speech using Qwen3-TTS with Apple Silicon GPU acceleration.
- macOS with Apple Silicon (M1/M2/M3/M4)
- Python 3.12 (via pyenv)
- ~4GB disk space for models
cd /Users/keith/Projects/Qwen-TTS
source qwen3-tts-env/bin/activateTrain your voice once, then generate unlimited speech.
source qwen3-tts-env/bin/activate
python voice_tts_app.pyOpen http://localhost:7860 in your browser.
Tab 1 - Train Voice (one time):
- Upload or record your voice (5-15 seconds)
- Enter the transcript of what you said (improves quality)
- Give your voice profile a name (e.g., "my_voice")
- Click "Train & Save Voice Profile"
Tab 2 - Generate Speech (unlimited):
- Select your saved voice profile
- Enter any text you want spoken
- Click "Generate Speech"
- Download the .wav file
Profiles are saved in voice_profiles/ and persist between sessions.
pkill -f "voice_tts_app.py"For quick one-time voice cloning without saving a profile:
python voice_clone_app.pyOpen http://localhost:7860 - upload audio, enter text, generate speech.
The Base model uses voice cloning - it requires a reference audio file to clone the voice characteristics.
import torch
from qwen_tts import Qwen3TTSModel
import soundfile as sf
# Load model
model = Qwen3TTSModel.from_pretrained(
"./models/Qwen3-TTS-12Hz-1.7B-Base",
device_map="mps",
dtype=torch.float32,
)
# Generate speech (x_vector_only_mode uses speaker embedding from reference)
wavs, sr = model.generate_voice_clone(
text="Hello, this is a test of Qwen text to speech.",
language="english",
ref_audio="reference.wav", # Your reference audio file
x_vector_only_mode=True,
do_sample=True,
temperature=0.8,
)
# Save output
sf.write("output.wav", wavs[0], sr)For better voice cloning, provide a transcript of the reference audio:
wavs, sr = model.generate_voice_clone(
text="Text you want to synthesize.",
language="english",
ref_audio="reference.wav",
ref_text="Transcript of what is said in reference.wav",
x_vector_only_mode=False, # Enables ICL mode
do_sample=True,
temperature=0.8,
)auto- Auto-detectenglish,chinese,french,german,italianjapanese,korean,portuguese,russian,spanish
# With transcript (better quality)
python create_voice_profile.py my_recording.wav --text "What I said in the recording"
# Without transcript
python create_voice_profile.py my_recording.wav --x-vector-only
# Custom output path
python create_voice_profile.py my_recording.wav -t "transcript" -o voice_profiles/custom_name.pklpython test_tts.py
afplay output.wav # Play the resultsource qwen3-tts-env/bin/activate
python ComfyUI/main.py --listen 0.0.0.0Open http://localhost:8188 in your browser.
- Right-click canvas → Add Node → Search for "Qwen3"
- Add
Qwen3TTSModelLoadernode - Add
Qwen3TTSGeneratenode - Connect them and configure:
- Model path:
./models/Qwen3-TTS-12Hz-1.7B-Base - Text: Your text to synthesize
- Language:
english(or other supported language) - Reference audio: Upload or connect an audio file
- Model path:
pkill -f "ComfyUI/main.py"Qwen-TTS/
├── qwen3-tts-env/ # Python virtual environment
├── models/
│ ├── Qwen3-TTS-12Hz-1.7B-Base/ # Main TTS model
│ └── Qwen3-TTS-Tokenizer-12Hz/ # Speech tokenizer
├── voice_profiles/ # Saved voice profiles (.pkl files)
├── ComfyUI/
│ └── custom_nodes/
│ └── ComfyUI-Qwen3-TTS/ # TTS nodes for ComfyUI
├── voice_tts_app.py # Main web UI: train + generate (recommended)
├── voice_clone_app.py # One-shot voice cloning web UI
├── create_voice_profile.py # CLI tool to create voice profiles
├── tts_app.py # Simple TTS using saved profiles
├── test_tts.py # Command line test script
└── output.wav # Generated audio output
- Reference audio quality matters - Use clear, noise-free recordings for best results
- MPS acceleration - The model runs on Apple Silicon GPU automatically
- Temperature - Lower (0.6-0.8) for more consistent output, higher (0.9-1.0) for variation
- flash-attn warning - Safe to ignore; it's CUDA-only and doesn't affect Mac
brew install soxEnsure models are downloaded:
huggingface-cli download Qwen/Qwen3-TTS-12Hz-1.7B-Base --local-dir ./models/Qwen3-TTS-12Hz-1.7B-Base
huggingface-cli download Qwen/Qwen3-TTS-Tokenizer-12Hz --local-dir ./models/Qwen3-TTS-Tokenizer-12HzRestart ComfyUI - it loads nodes on startup.