Increasing Transcription Delays Over Time #152

GKT1 · 2025-01-21T16:41:43Z

When using Whisper with the Faster Whisper backend for real-time transcription, the transcription process becomes progressively slower over time.
Initially, the transcription is processed in near real-time, but as the session continues, the delay in processing increases. Eventually, the logger reports processing large audio chunks, such as: Processing audio with duration 39:49.882 from faster whisper backend, which is a whole 39 minutes of audio
This duration continues to grow as time progresses.
This problem seems to occur under the following conditions:

A large audio chunk is received.
There are silent parts in the audio.
The audio contains multiple languages.
The audio quality is poor.

Code to Reproduce:

from whisper_online import *
import numpy as np
import subprocess
import logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    handlers=[logging.StreamHandler(sys.stdout)]
)
logger = logging.getLogger(__name__)

cmd = [
    "ffmpeg",
    "-i", "file.mp4", 
    "-vn",
    "-acodec", "pcm_f32le",
    "-ar", "16000", 
    "-ac", "1",     
    "-f", "f32le",       
    "pipe:1"
]
proc = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.DEVNULL)

asr = FasterWhisperASR("ja", "large-v2") 
online = OnlineASRProcessor(asr)

while True:  
    chunk_size = 16000 * 4 * 3
    raw_pcm = proc.stdout.read(chunk_size) 
    if not raw_pcm:
        break
    audio_array = np.frombuffer(raw_pcm, dtype=np.float32)
    online.insert_audio_chunk(audio_array)
    o = online.process_iter()
    print(o)

The text was updated successfully, but these errors were encountered:

Gldkslfmsd · 2025-01-22T10:03:05Z

There are silent parts in the audio.

use --vac option for Voice Activity Controller -- it is able to detect and skip the silent parts.
Also, the option --vad could help.

The audio quality is poor.

In this case, it is expected that the quality and latency is poor. You can check if the whisper model you're using is able to transcribe/translate your audio correctly in offline mode. If not, you need a better model. If yes, you can try bigger min-chunk-size.

The audio contains multiple languages.

This is challenging for Whisper, especially for ASR. There is automatic language detection from every incoming chunk. If it's unclear or ambiguous, it can be wrong and then transcript is wrong. Bigger min-chunk-size could help. Or updating whisper-streaming code so that the lang. detection is more robust, like select only from preselected options, or use bigger chunk to detect the language and then keep it until the next silence/swap of speakers. Diarization could be added for that.

Good luck! Reopen if you need to continue with discussion.

GKT1 · 2025-01-22T11:37:01Z

Thank you for the response! The issue seems to be that it’s sending 40 minutes of audio continuously to Faster Whisper every second, and I also think this won’t help the transcription, which makes it pretty much unusable. What’s worse is that the length of the audio keeps growing longer over time.

Maybe there could be an option to reset the process when this happens, ignore prefix, or just something to force the audio buffer under a limit? That might help keep things running smoothly. What do you think?

Gldkslfmsd · 2025-01-24T11:37:13Z

Maybe there could be an option to reset the process when this happens, ignore prefix, or just something to force the audio buffer under a limit? That might help keep things running smoothly. What do you think?

yes, I also notice it, but it's rare and never found replicable input to debug it. Can you share the options, model and audio input where the bug can be replicated?

GKT1 · 2025-01-26T15:50:14Z

I use the code from the original post, which splits the audio into 3-second chunks and processes each using the large-v2 model with default settings on a T4 gpu.
Audio source: https://www.youtube.com/watch?v=yjimwXeHw44 https://drive.google.com/file/d/1Ndw9_GTaeEOm1dFcGoGM_FpeGRnyDLmx
Log file after 12 hours of transcription. output (2).log
Sometimes, the audio buffer grows to 5 or 6 minutes. I also notice that processing time occasionally becomes much longer, and it doesn't seem to be related to the buffer length.
If I use vad or vac, the problem seem to goes away, but then a lot of unclear speech—which Whisper could still process—gets skipped. Because of this, I prefer not to use vad or vac.

Gldkslfmsd closed this as completed Jan 22, 2025

Gldkslfmsd reopened this Jan 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increasing Transcription Delays Over Time #152

Increasing Transcription Delays Over Time #152

GKT1 commented Jan 21, 2025

Gldkslfmsd commented Jan 22, 2025

GKT1 commented Jan 22, 2025

Gldkslfmsd commented Jan 24, 2025

GKT1 commented Jan 26, 2025

Increasing Transcription Delays Over Time #152

Increasing Transcription Delays Over Time #152

Comments

GKT1 commented Jan 21, 2025

Gldkslfmsd commented Jan 22, 2025

GKT1 commented Jan 22, 2025

Gldkslfmsd commented Jan 24, 2025

GKT1 commented Jan 26, 2025