Skip to content

Voice cloning and TTS using Qwen3-TTS on Mac M4 (Apple Silicon)

Notifications You must be signed in to change notification settings

kteare/qwen_voice

Repository files navigation

Qwen3-TTS on Mac M4

Local text-to-speech using Qwen3-TTS with Apple Silicon GPU acceleration.

Requirements

  • macOS with Apple Silicon (M1/M2/M3/M4)
  • Python 3.12 (via pyenv)
  • ~4GB disk space for models

Quick Start

Activate the environment

cd /Users/keith/Projects/Qwen-TTS
source qwen3-tts-env/bin/activate

Voice TTS Web UI (Recommended)

Train your voice once, then generate unlimited speech.

Start the Web UI

source qwen3-tts-env/bin/activate
python voice_tts_app.py

Open http://localhost:7860 in your browser.

How to Use

Tab 1 - Train Voice (one time):

  1. Upload or record your voice (5-15 seconds)
  2. Enter the transcript of what you said (improves quality)
  3. Give your voice profile a name (e.g., "my_voice")
  4. Click "Train & Save Voice Profile"

Tab 2 - Generate Speech (unlimited):

  1. Select your saved voice profile
  2. Enter any text you want spoken
  3. Click "Generate Speech"
  4. Download the .wav file

Voice Profiles

Profiles are saved in voice_profiles/ and persist between sessions.

Stop the Web UI

pkill -f "voice_tts_app.py"

Alternative: One-Shot Voice Clone

For quick one-time voice cloning without saving a profile:

python voice_clone_app.py

Open http://localhost:7860 - upload audio, enter text, generate speech.


Command Line Usage

Basic Voice Cloning

The Base model uses voice cloning - it requires a reference audio file to clone the voice characteristics.

import torch
from qwen_tts import Qwen3TTSModel
import soundfile as sf

# Load model
model = Qwen3TTSModel.from_pretrained(
    "./models/Qwen3-TTS-12Hz-1.7B-Base",
    device_map="mps",
    dtype=torch.float32,
)

# Generate speech (x_vector_only_mode uses speaker embedding from reference)
wavs, sr = model.generate_voice_clone(
    text="Hello, this is a test of Qwen text to speech.",
    language="english",
    ref_audio="reference.wav",  # Your reference audio file
    x_vector_only_mode=True,
    do_sample=True,
    temperature=0.8,
)

# Save output
sf.write("output.wav", wavs[0], sr)

With Reference Text (ICL Mode)

For better voice cloning, provide a transcript of the reference audio:

wavs, sr = model.generate_voice_clone(
    text="Text you want to synthesize.",
    language="english",
    ref_audio="reference.wav",
    ref_text="Transcript of what is said in reference.wav",
    x_vector_only_mode=False,  # Enables ICL mode
    do_sample=True,
    temperature=0.8,
)

Supported Languages

  • auto - Auto-detect
  • english, chinese, french, german, italian
  • japanese, korean, portuguese, russian, spanish

Create Voice Profile (CLI)

# With transcript (better quality)
python create_voice_profile.py my_recording.wav --text "What I said in the recording"

# Without transcript
python create_voice_profile.py my_recording.wav --x-vector-only

# Custom output path
python create_voice_profile.py my_recording.wav -t "transcript" -o voice_profiles/custom_name.pkl

Run the Test Script

python test_tts.py
afplay output.wav  # Play the result

ComfyUI Web Interface

Start ComfyUI

source qwen3-tts-env/bin/activate
python ComfyUI/main.py --listen 0.0.0.0

Open http://localhost:8188 in your browser.

Using Qwen3-TTS Nodes

  1. Right-click canvas → Add Node → Search for "Qwen3"
  2. Add Qwen3TTSModelLoader node
  3. Add Qwen3TTSGenerate node
  4. Connect them and configure:
    • Model path: ./models/Qwen3-TTS-12Hz-1.7B-Base
    • Text: Your text to synthesize
    • Language: english (or other supported language)
    • Reference audio: Upload or connect an audio file

Stop ComfyUI

pkill -f "ComfyUI/main.py"

Project Structure

Qwen-TTS/
├── qwen3-tts-env/          # Python virtual environment
├── models/
│   ├── Qwen3-TTS-12Hz-1.7B-Base/      # Main TTS model
│   └── Qwen3-TTS-Tokenizer-12Hz/      # Speech tokenizer
├── voice_profiles/         # Saved voice profiles (.pkl files)
├── ComfyUI/
│   └── custom_nodes/
│       └── ComfyUI-Qwen3-TTS/         # TTS nodes for ComfyUI
├── voice_tts_app.py        # Main web UI: train + generate (recommended)
├── voice_clone_app.py      # One-shot voice cloning web UI
├── create_voice_profile.py # CLI tool to create voice profiles
├── tts_app.py              # Simple TTS using saved profiles
├── test_tts.py             # Command line test script
└── output.wav              # Generated audio output

Tips

  • Reference audio quality matters - Use clear, noise-free recordings for best results
  • MPS acceleration - The model runs on Apple Silicon GPU automatically
  • Temperature - Lower (0.6-0.8) for more consistent output, higher (0.9-1.0) for variation
  • flash-attn warning - Safe to ignore; it's CUDA-only and doesn't affect Mac

Troubleshooting

"SoX could not be found"

brew install sox

Model loading errors

Ensure models are downloaded:

huggingface-cli download Qwen/Qwen3-TTS-12Hz-1.7B-Base --local-dir ./models/Qwen3-TTS-12Hz-1.7B-Base
huggingface-cli download Qwen/Qwen3-TTS-Tokenizer-12Hz --local-dir ./models/Qwen3-TTS-Tokenizer-12Hz

ComfyUI custom node not showing

Restart ComfyUI - it loads nodes on startup.


Links

About

Voice cloning and TTS using Qwen3-TTS on Mac M4 (Apple Silicon)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages