Speech diarization and transcript formatting toolkit. Extract speaker-labeled transcripts from audio and format them into readable scripts.
- Python 3.11+
- HuggingFace access token (for pyannote models)
- GPU recommended for faster processing
- Audio files in .wav format
This project uses uv for dependency management:
# Install uv if you haven't already
curl -LsSf https://astral.sh/uv/install.sh | sh
# Clone the repository
git clone <repository-url>
cd speech-mine
# Install dependencies and create virtual environment
uv sync# Show all available commands
uv run speech-mine --help
# Extract transcript from audio file
uv run speech-mine extract meeting.wav output.csv --hf-token YOUR_HUGGINGFACE_TOKEN
# Format CSV to readable movie-style script
uv run speech-mine format output.csv formatted_script.txt# Direct audio extraction
uv run extract-audio meeting.wav output.csv --hf-token YOUR_TOKEN
# Direct script formatting
uv run format-script output.csv script.txt# 1. Extract with known speaker count (best accuracy)
uv run speech-mine extract meeting.wav transcript.csv \
--hf-token YOUR_TOKEN \
--num-speakers 4 \
--model large-v3 \
--compute-type float32
# 2. Format to readable script
uv run speech-mine format transcript.csv meeting_script.txt
# 3. Create custom speaker names template
uv run speech-mine format transcript.csv script.txt --create-template
# 4. Format with custom speaker names
uv run speech-mine format transcript.csv final_script.txt \
--speakers transcript_speaker_names.json# Perfect for interviews
uv run speech-mine extract interview.wav interview.csv \
--hf-token YOUR_TOKEN \
--num-speakers 2 \
--model medium \
--compute-type float32
uv run speech-mine format interview.csv interview_script.txt# For systems without GPU
uv run speech-mine extract audio.wav output.csv \
--hf-token YOUR_TOKEN \
--model base \
--device cpu \
--compute-type float32 \
--num-speakers 2# Use specific Whisper model and GPU with known number of speakers
uv run speech-mine extract interview.wav results.csv \
--hf-token YOUR_TOKEN \
--model large-v3 \
--device cuda \
--num-speakers 2 \
--compute-type float16 \
--verbose
# Use smaller model for faster CPU processing
uv run speech-mine extract podcast.wav transcript.csv \
--hf-token YOUR_TOKEN \
--model base \
--device cpu \
--compute-type float32 \
--min-speakers 2 \
--max-speakers 4
# Meeting with exact number of known speakers (best accuracy)
uv run speech-mine extract meeting.wav transcript.csv \
--hf-token YOUR_TOKEN \
--num-speakers 5 \
--model medium \
--compute-type float32
# Format with custom speaker names
echo '{"SPEAKER_00":"Alice","SPEAKER_01":"Bob"}' > speakers.json
uv run speech-mine format transcript.csv script.txt --speakers speakers.json# See scripts/batch_process.sh and scripts/batch_format.sh for examples
# Example: batch_format.sh
./scripts/batch_format.sh input_dir output_dir
# Example: batch_process.sh
./scripts/batch_process.sh input_dir output_dirAvailable Whisper models (smaller = faster, larger = more accurate):
tiny: Fastest, least accuratebase: Good balance for quick processingsmall: Better accuracy, moderate speedmedium: Good accuracy and speedlarge-v3: Best accuracy (default)turbo: Fast and accurate
Device Options:
auto: Automatically detect best device (default)cuda: Force GPU usage (requires NVIDIA GPU)cpu: Force CPU usage
Compute Type Options:
float32: CPU-compatible, slower but works everywhere (recommended for CPU)float16: GPU-optimized, faster (recommended for CUDA)int8: Fastest, slightly lower accuracy
--compute-type float32 when running on CPU to avoid errors!
Best accuracy - exact number of speakers:
uv run speech-mine extract meeting.wav output.csv \
--hf-token $HF_TOKEN \
--num-speakers 3 \
--compute-type float32Range-based speaker detection:
uv run speech-mine extract conference.wav output.csv \
--hf-token $HF_TOKEN \
--min-speakers 2 \
--max-speakers 8 \
--compute-type float32| Parameter | Description | When to Use |
|---|---|---|
--num-speakers N |
Exact number of speakers | When you know exactly how many speakers (best accuracy) |
--min-speakers N |
Minimum speakers (default: 1) | Set to 2+ if you know multiple people speak |
--max-speakers N |
Maximum speakers | Limit false speaker detection in noisy audio |
π‘ Pro tip: Specifying --num-speakers when you know the exact count can improve accuracy by 15-30%!
The tool generates multiple output files:
Contains both segment-level and word-level data:
| Column | Description |
|---|---|
type |
"segment" or "word" |
speaker |
Speaker identifier (SPEAKER_00, SPEAKER_01, etc.) |
start |
Start timestamp in seconds |
end |
End timestamp in seconds |
text |
Full segment text |
word |
Individual word (for word-type rows) |
word_position |
Position of word in segment |
confidence |
Confidence score (0-1) |
overlap_duration |
Speaker overlap duration |
Human-readable movie-style script format:
================================================================================
TRANSCRIPT
================================================================================
RECORDING DETAILS:
----------------------------------------
File: meeting.wav
Duration: 30:45
Language: ENGLISH (confidence: 95.2%)
Speakers: 3
Processed: 2025-09-08 16:35:00
CAST:
----------------------------------------
SPEAKER A
SPEAKER B
SPEAKER C
TRANSCRIPT:
----------------------------------------
[00:00 - 00:05] SPEAKER A:
Good morning everyone, let's start the meeting.
[00:06 - 00:12] SPEAKER B:
Thanks for organizing this. I have the quarterly
report ready to share.
[...3 second pause...]
[00:15 - 00:22] SPEAKER C:
Perfect, I'd like to hear about the sales numbers
first.
Contains processing information:
{
"audio_file": "meeting.wav",
"language": "en",
"language_probability": 0.95,
"duration": 1800.0,
"total_segments": 234,
"total_words": 3456,
"speakers": ["SPEAKER_00", "SPEAKER_01", "SPEAKER_02"],
"processing_timestamp": "2025-09-08 14:30:00"
}- Create account at HuggingFace
- Go to Settings β Access Tokens
- Create a new token with read permissions
- Accept the user agreement at pyannote/speaker-diarization-3.1
- Format: .wav files only
- Quality: 16kHz+ sample rate recommended
- Duration: No specific limits (longer files take more time)
- Channels: Mono or stereo supported
| Model | Speed | Accuracy | Best For |
|---|---|---|---|
tiny |
β‘β‘β‘β‘β‘ | ββ | Quick tests, drafts |
base |
β‘β‘β‘β‘ | βββ | Fast processing, good quality |
small |
β‘β‘β‘ | ββββ | Balanced speed/accuracy |
medium |
β‘β‘ | ββββ | Good quality, reasonable speed |
large-v3 |
β‘ | βββββ | Best quality, slow |
# β This fails on CPU:
uv run speech-mine extract audio.wav out.csv --hf-token TOKEN
# β
This works on CPU:
uv run speech-mine extract audio.wav out.csv --hf-token TOKEN --compute-type float32
# β
This works on any system:
uv run speech-mine extract audio.wav out.csv --hf-token TOKEN --device cpu --compute-type float32 --model base| Use Case | Command | Notes |
|---|---|---|
| Basic extraction (CPU) | uv run speech-mine extract audio.wav out.csv --hf-token TOKEN --compute-type float32 |
Safe for all systems |
| 2-person interview | uv run speech-mine extract interview.wav out.csv --hf-token TOKEN --num-speakers 2 --compute-type float32 |
Exact count for best accuracy |
| Meeting (known attendees) | uv run speech-mine extract meeting.wav out.csv --hf-token TOKEN --num-speakers 5 --compute-type float32 |
Count participants beforehand |
| Fast processing | uv run speech-mine extract audio.wav out.csv --hf-token TOKEN --model base --compute-type float32 |
Trade accuracy for speed |
| Format transcript | uv run speech-mine format transcript.csv script.txt |
Create readable script |
| GPU processing | uv run speech-mine extract audio.wav out.csv --hf-token TOKEN --device cuda --compute-type float16 |
Faster with GPU |
# Set token once (optional)
export HF_TOKEN="your_huggingface_token"
# Then you can omit --hf-token:
uv run speech-mine extract audio.wav out.csv --compute-type float32TBD