Skip to content

Commit bc52add

Browse files
fixie (#56)
* fixie * fix bug * fix names * quickstart
1 parent c90d288 commit bc52add

File tree

92 files changed

+4707
-2212
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

92 files changed

+4707
-2212
lines changed

.gitignore

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -42,3 +42,9 @@ extras/speaker-omni-experimental/cache/*
4242

4343
# AI Stuff
4444
.claude
45+
46+
# SSL
47+
extras/speaker-recognition/ssl/*
48+
49+
# nginx
50+
extras/speaker-recognition/nginx.conf

CLAUDE.md

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -307,6 +307,11 @@ websocket.send(JSON.stringify(audioStop) + '\n');
307307
### Code Style
308308
- **Python**: Black formatter with 100-character line length, isort for imports
309309
- **TypeScript**: Standard React Native conventions
310+
- **Import Guidelines**:
311+
- NEVER import modules in the middle of functions or files
312+
- ALL imports must be at the top of the file after the docstring
313+
- Use lazy imports sparingly and only when absolutely necessary for circular import issues
314+
- Group imports: standard library, third-party, local imports
310315

311316
### Health Monitoring
312317
The system includes comprehensive health checks:
@@ -405,6 +410,11 @@ Access via: `extras/speaker-recognition/webui` → Live Inference
405410
3. Adjust speaker identification settings (confidence threshold)
406411
4. Start live session to begin real-time transcription and speaker ID
407412

413+
**Technical Details:**
414+
- **Audio Processing**: Uses browser's native sample rate (typically 44.1kHz or 48kHz, not hardcoded 16kHz)
415+
- **Buffer Retention**: 120 seconds of audio for improved utterance capture
416+
- **Real-time Updates**: Live transcription with speaker identification results
417+
408418
#### Using Speaker Analysis
409419
1. Go to Speakers page → Embedding Analysis tab
410420
2. Select analysis method (UMAP, t-SNE, PCA)
@@ -418,6 +428,13 @@ Access via: `extras/speaker-recognition/webui` → Live Inference
418428
- Live inference requires Deepgram API key for streaming transcription
419429
- Speaker identification uses existing enrolled speakers from database
420430

431+
### Live Inference Troubleshooting
432+
- **"NaN:NaN" timestamps**: Fixed in recent updates, ensure you're using latest version
433+
- **Poor speaker identification**: Try adjusting confidence threshold or re-enrolling speakers
434+
- **Audio processing delays**: Check browser console for sample rate detection logs
435+
- **Buffer overflow issues**: Extended to 120-second retention for better performance
436+
- **"extraction_failed" errors**: Usually indicates audio buffer timing issues - check console logs for buffer availability
437+
421438
## Notes for Claude
422439
Check if the src/ is volume mounted. If not, do compose build so that code changes are reflected. Do not simply run `docker compose restart` as it will not rebuild the image.
423440
Check backend/advanced-backend/Docs for up to date information on advanced backend.

backends/advanced/.env.template

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -83,11 +83,13 @@ DEBUG_DIR=./data/debug_dir
8383
# ========================================
8484
# These settings control how the browser accesses the backend for audio playback
8585

86-
# The IP address or hostname where your backend is publicly accessible
86+
# The IP address or hostname where your backend is publicly accessible from the browser
8787
# Examples:
8888
# - For local development: localhost or 127.0.0.1
8989
# - For LAN access: your machine's IP (e.g., 192.168.1.100)
90+
# - For VPN/Tailscale access: your VPN IP (e.g., 100.64.x.x for Tailscale)
9091
# - For internet access: your domain or public IP (e.g., friend.example.com)
92+
# Note: This must be accessible from your browser, not from the Docker container
9193
HOST_IP=localhost
9294

9395
# Backend API port (where audio files are served)

backends/advanced/Docs/README_speaker_enrollment.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -181,10 +181,10 @@ Edit `speaker_recognition/speaker_recognition.py` to adjust:
181181

182182
### Audio Settings
183183

184-
The system is configured for:
185-
- Sample rate: 16kHz
186-
- Channels: Mono
187-
- Format: WAV files
184+
The system supports:
185+
- Sample rate: Dynamic detection (commonly 16kHz, 44.1kHz, or 48kHz)
186+
- Channels: Mono (stereo converted to mono automatically)
187+
- Format: WAV files (recommended), WebM, MP4
188188

189189
## Troubleshooting
190190

backends/advanced/Docs/architecture.md

Lines changed: 58 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -241,7 +241,7 @@ Wyoming is a peer-to-peer protocol for voice assistants that combines JSONL (JSO
241241
- **Wyoming Protocol + Opus Decoding**: Combines Wyoming session management with OMI Opus decoding
242242
- **Continuous Streaming**: OMI devices stream continuously, audio-start/stop events are optional
243243
- **Timestamp Preservation**: Uses timestamps from Wyoming headers when provided
244-
- **OMI-Optimized**: Hardcoded 16kHz mono format for OMI device compatibility
244+
- **Dynamic Sample Rate**: Automatically detects and adapts to client sample rate (typically 16kHz for OMI devices, but supports other rates)
245245

246246
**Simple Backend (`/ws`)**:
247247
- **Minimal Wyoming Support**: Parses audio-chunk events, silently ignores control events
@@ -317,6 +317,24 @@ client_state = ClientState(
317317
- **Connection Tracking**: Real-time monitoring of active clients
318318
- **State Management**: Simplified client state for conversation tracking only
319319
- **Centralized Processing**: Application-level processors handle all background tasks
320+
- **Dynamic Sample Rate**: Client state tracks actual sample rate from audio chunks
321+
- **Audio Buffer Management**: Sophisticated buffer system with timing and collection management
322+
323+
### Audio Buffer Management
324+
325+
The system implements advanced audio buffer management for reliable processing:
326+
327+
**Buffer Collection**:
328+
- **Retention**: Configurable buffer retention (default 120 seconds for speaker identification)
329+
- **Timeout**: 1.5 minute collection timeout to prevent indefinite buffering
330+
- **Isolation**: Each client maintains isolated buffer state
331+
- **Dynamic Sizing**: Adapts to actual sample rate and chunk sizes
332+
333+
**Buffer State Tracking**:
334+
- Sample rate detection from incoming audio chunks
335+
- Automatic fallback to default rates when not specified
336+
- Buffer timing synchronization for accurate segment extraction
337+
- Memory-efficient circular buffer implementation
320338

321339
### Application-Level Processing Architecture
322340

@@ -785,6 +803,45 @@ flowchart TB
785803
5. **Authorization**: Per-endpoint permission checking with simplified ownership validation
786804
6. **Data Isolation**: User-scoped data access via client ID mapping and ownership validation
787805

806+
## Speaker Recognition Integration
807+
808+
The advanced backend integrates with an external speaker recognition service for real-time speaker identification during conversations.
809+
810+
### Integration Architecture
811+
812+
**Service Communication**:
813+
- **HTTP API**: RESTful endpoints for speaker enrollment and management
814+
- **Real-time Processing**: Speaker identification during live transcription
815+
- **Asynchronous Pipeline**: Non-blocking speaker identification parallel to transcription
816+
817+
**Key Features**:
818+
- **Dynamic Enrollment**: Add speakers through audio samples
819+
- **Live Identification**: Real-time speaker recognition during conversations
820+
- **Confidence Scoring**: Adjustable thresholds for identification accuracy
821+
- **Multi-speaker Support**: Handles conversations with multiple participants
822+
823+
### Speaker Recognition Flow
824+
825+
1. **Audio Collection**: Capture audio chunks with proper buffering
826+
2. **Feature Extraction**: Generate speaker embeddings from audio segments
827+
3. **Identity Matching**: Compare against enrolled speaker database
828+
4. **Result Integration**: Enhance transcripts with speaker identification
829+
830+
### Configuration
831+
832+
```yaml
833+
# Environment variables for speaker recognition
834+
SPEAKER_SERVICE_URL: "http://speaker-recognition:8001"
835+
SPEAKER_CONFIDENCE_THRESHOLD: 0.15 # Adjustable confidence level
836+
```
837+
838+
### API Endpoints
839+
840+
- `POST /api/speaker/enroll` - Enroll new speaker with audio samples
841+
- `GET /api/speaker/list` - List enrolled speakers
842+
- `POST /api/speaker/identify` - Identify speaker from audio segment
843+
- `DELETE /api/speaker/{speaker_id}` - Remove enrolled speaker
844+
788845
## Security Architecture
789846

790847
### Authentication Layers
Lines changed: 132 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,132 @@
1+
# Memory Configuration Guide
2+
3+
This guide helps you set up and configure the memory system for the Friend Advanced Backend.
4+
5+
## Quick Start
6+
7+
1. **Copy the template configuration**:
8+
```bash
9+
cp memory_config.yaml.template memory_config.yaml
10+
```
11+
12+
2. **Edit memory_config.yaml** with your preferred settings:
13+
```yaml
14+
memory:
15+
provider: "mem0" # or "basic" for simpler setup
16+
17+
# Provider-specific configuration
18+
mem0:
19+
model_provider: "openai" # or "ollama" for local
20+
embedding_model: "text-embedding-3-small"
21+
llm_model: "gpt-4o-mini"
22+
```
23+
24+
3. **Set environment variables** in `.env`:
25+
```bash
26+
# For OpenAI
27+
OPENAI_API_KEY=your-api-key
28+
29+
# For Ollama (local)
30+
OLLAMA_BASE_URL=http://ollama:11434
31+
```
32+
33+
## Configuration Options
34+
35+
### Memory Providers
36+
37+
#### mem0 (Recommended)
38+
Advanced memory system with semantic search and context awareness.
39+
40+
**Configuration**:
41+
```yaml
42+
memory:
43+
provider: "mem0"
44+
mem0:
45+
model_provider: "openai" # or "ollama"
46+
embedding_model: "text-embedding-3-small"
47+
llm_model: "gpt-4o-mini"
48+
prompt_template: "custom_prompt_here" # Optional
49+
```
50+
51+
#### basic
52+
Simple memory storage without advanced features.
53+
54+
**Configuration**:
55+
```yaml
56+
memory:
57+
provider: "basic"
58+
# No additional configuration needed
59+
```
60+
61+
### Model Selection
62+
63+
#### OpenAI Models
64+
- **LLM**: `gpt-4o-mini`, `gpt-4o`, `gpt-3.5-turbo`
65+
- **Embeddings**: `text-embedding-3-small`, `text-embedding-3-large`
66+
67+
#### Ollama Models (Local)
68+
- **LLM**: `llama3`, `mistral`, `qwen2.5`
69+
- **Embeddings**: `nomic-embed-text`, `all-minilm`
70+
71+
## Hot Reload
72+
73+
The configuration supports hot reloading - changes are applied automatically without restarting the service.
74+
75+
## Validation
76+
77+
The system validates your configuration on startup and logs any issues:
78+
- Missing required fields
79+
- Invalid provider names
80+
- Incompatible model combinations
81+
82+
## Troubleshooting
83+
84+
### Common Issues
85+
86+
1. **"Provider not found"**: Check spelling in `provider` field
87+
2. **"API key missing"**: Ensure environment variables are set
88+
3. **"Model not available"**: Verify model names match provider's available models
89+
4. **"Connection refused"**: Check Ollama is running if using local models
90+
91+
### Debug Mode
92+
93+
Enable debug logging by setting:
94+
```bash
95+
DEBUG=true
96+
```
97+
98+
This provides detailed information about memory processing and configuration loading.
99+
100+
## Examples
101+
102+
### OpenAI Setup
103+
```yaml
104+
memory:
105+
provider: "mem0"
106+
mem0:
107+
model_provider: "openai"
108+
embedding_model: "text-embedding-3-small"
109+
llm_model: "gpt-4o-mini"
110+
```
111+
112+
### Local Ollama Setup
113+
```yaml
114+
memory:
115+
provider: "mem0"
116+
mem0:
117+
model_provider: "ollama"
118+
embedding_model: "nomic-embed-text"
119+
llm_model: "llama3"
120+
```
121+
122+
### Minimal Setup
123+
```yaml
124+
memory:
125+
provider: "basic"
126+
```
127+
128+
## Next Steps
129+
130+
- Configure action items detection in `memory_config.yaml`
131+
- Set up custom prompt templates for your use case
132+
- Monitor memory processing in the debug dashboard

backends/advanced/docker-compose.yml

Lines changed: 1 addition & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -104,36 +104,6 @@ services:
104104
# - ./nginx.conf:/etc/nginx/nginx.conf:ro
105105
# ports: ["80:80"] # publish once; ngrok points here
106106

107-
# speaker-recognition:
108-
# build:
109-
# context: ../../extras/speaker-recognition
110-
# dockerfile: Dockerfile
111-
# # image: speaker-recognition:latest
112-
# ports:
113-
# - "8001:8001"
114-
# volumes:
115-
# # Persist Hugging Face cache (models) between restarts
116-
# - ./data/speaker_model_cache:/models
117-
# - ./data/audio_chunks:/app/audio_chunks # Share audio chunks with backend
118-
# - ./data/speaker_debug:/app/debug
119-
# deploy:
120-
# resources:
121-
# reservations:
122-
# devices:
123-
# - driver: nvidia
124-
# count: all
125-
# capabilities: [gpu]
126-
# environment:
127-
# - HF_HOME=/models
128-
# - HF_TOKEN=${HF_TOKEN}
129-
# - SIMILARITY_THRESHOLD=0.85
130-
# restart: unless-stopped
131-
# healthcheck:
132-
# test: ["CMD", "curl", "-f", "http://localhost:8001/health"]
133-
# interval: 30s
134-
# timeout: 10s
135-
# retries: 3
136-
137107
# ollama:
138108
# image: ollama/ollama:latest
139109
# container_name: ollama
@@ -172,4 +142,4 @@ services:
172142
# neo4j_data:
173143
# driver: local
174144
# neo4j_logs:
175-
# driver: local
145+
# driver: local

0 commit comments

Comments
 (0)