Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increasing Transcription Delays Over Time #152

Open
GKT1 opened this issue Jan 21, 2025 · 4 comments
Open

Increasing Transcription Delays Over Time #152

GKT1 opened this issue Jan 21, 2025 · 4 comments

Comments

@GKT1
Copy link

GKT1 commented Jan 21, 2025

  • When using Whisper with the Faster Whisper backend for real-time transcription, the transcription process becomes progressively slower over time.
  • Initially, the transcription is processed in near real-time, but as the session continues, the delay in processing increases. Eventually, the logger reports processing large audio chunks, such as: Processing audio with duration 39:49.882 from faster whisper backend, which is a whole 39 minutes of audio
  • This duration continues to grow as time progresses.
  • This problem seems to occur under the following conditions:
  1. A large audio chunk is received.
  2. There are silent parts in the audio.
  3. The audio contains multiple languages.
  4. The audio quality is poor.

Code to Reproduce:

from whisper_online import *
import numpy as np
import subprocess
import logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    handlers=[logging.StreamHandler(sys.stdout)]
)
logger = logging.getLogger(__name__)

cmd = [
    "ffmpeg",
    "-i", "file.mp4", 
    "-vn",
    "-acodec", "pcm_f32le",
    "-ar", "16000", 
    "-ac", "1",     
    "-f", "f32le",       
    "pipe:1"
]
proc = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.DEVNULL)

asr = FasterWhisperASR("ja", "large-v2") 
online = OnlineASRProcessor(asr)

while True:  
    chunk_size = 16000 * 4 * 3
    raw_pcm = proc.stdout.read(chunk_size) 
    if not raw_pcm:
        break
    audio_array = np.frombuffer(raw_pcm, dtype=np.float32)
    online.insert_audio_chunk(audio_array)
    o = online.process_iter()
    print(o)
@Gldkslfmsd
Copy link
Collaborator

There are silent parts in the audio.

use --vac option for Voice Activity Controller -- it is able to detect and skip the silent parts.
Also, the option --vad could help.

The audio quality is poor.

In this case, it is expected that the quality and latency is poor. You can check if the whisper model you're using is able to transcribe/translate your audio correctly in offline mode. If not, you need a better model. If yes, you can try bigger min-chunk-size.

The audio contains multiple languages.

This is challenging for Whisper, especially for ASR. There is automatic language detection from every incoming chunk. If it's unclear or ambiguous, it can be wrong and then transcript is wrong. Bigger min-chunk-size could help. Or updating whisper-streaming code so that the lang. detection is more robust, like select only from preselected options, or use bigger chunk to detect the language and then keep it until the next silence/swap of speakers. Diarization could be added for that.

Good luck! Reopen if you need to continue with discussion.

@GKT1
Copy link
Author

GKT1 commented Jan 22, 2025

Thank you for the response! The issue seems to be that it’s sending 40 minutes of audio continuously to Faster Whisper every second, and I also think this won’t help the transcription, which makes it pretty much unusable. What’s worse is that the length of the audio keeps growing longer over time.

Maybe there could be an option to reset the process when this happens, ignore prefix, or just something to force the audio buffer under a limit? That might help keep things running smoothly. What do you think?

@Gldkslfmsd Gldkslfmsd reopened this Jan 22, 2025
@Gldkslfmsd
Copy link
Collaborator

Maybe there could be an option to reset the process when this happens, ignore prefix, or just something to force the audio buffer under a limit? That might help keep things running smoothly. What do you think?

yes, I also notice it, but it's rare and never found replicable input to debug it. Can you share the options, model and audio input where the bug can be replicated?

@GKT1
Copy link
Author

GKT1 commented Jan 26, 2025

  • I use the code from the original post, which splits the audio into 3-second chunks and processes each using the large-v2 model with default settings on a T4 gpu.

  • Audio source: https://www.youtube.com/watch?v=yjimwXeHw44 https://drive.google.com/file/d/1Ndw9_GTaeEOm1dFcGoGM_FpeGRnyDLmx

  • Log file after 12 hours of transcription. output (2).log

  • Sometimes, the audio buffer grows to 5 or 6 minutes. I also notice that processing time occasionally becomes much longer, and it doesn't seem to be related to the buffer length.

  • If I use vad or vac, the problem seem to goes away, but then a lot of unclear speech—which Whisper could still process—gets skipped. Because of this, I prefer not to use vad or vac.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants