Fix runpod errors #258

ggmarts04 · 2025-05-22T10:45:30Z

No description provided.

This commit introduces changes to allow the model to be deployed on RunPod Serverless. Key changes include: - **`runpod_handler.py`**: A new handler script that serves as the entry point for RunPod. It processes video and audio URL inputs, downloads the files, and then uses the existing `Predictor` class to perform inference. - **`Dockerfile`**: A new Dockerfile to build the container image for RunPod. It includes system dependencies (ffmpeg, libgl1), Python packages from `requirements.txt`, the downloader, and sets the appropriate CMD for the RunPod Python environment. - **`requirements.txt`**: Added the `runpod` package, which is necessary for the RunPod serverless environment. - **`RUNPOD_DEPLOYMENT.md`**: New documentation detailing how to build the Docker image, deploy it to RunPod, and make requests to the serverless endpoint. The existing core inference logic in `predict.py` and `scripts/inference.py` remains largely unchanged, with the new handler acting as an adapter for the RunPod environment and URL-based inputs.

feat: Adapt model for RunPod serverless deployment

The previous Dockerfile was missing `curl`, which is required for downloading the `pget` tool during the image build process. This resulted in a "curl: not found" error. This commit adds `curl` to the `apt-get install` command in the Dockerfile to ensure it is available.

fix: Add curl to Dockerfile system dependencies

This commit addresses two main issues: 1. **Fixes `pget` 404 error during local testing:** The `runpod_handler.py` script had a local testing block that attempted to download a `CHANGELOG.md` file from a URL that resulted in a 404 error. This was causing RunPod's local test run to fail. The problematic URL in the `mock_event` for `audio_url` has been changed to a valid one (uses the README.md URL, same as `video_url` for test purposes). 2. **Prevents errors from existing symlinks:** The `predict.py` script's `setup()` method attempted to create symbolic links without checking if they already existed. This could lead to errors if the script was run multiple times or if the links were already present. The script has been updated to check for the existence of these links before attempting to create them. These changes should improve the reliability of deploying and running this model on RunPod Serverless.

This commit addresses the following issues: 1. **Fixes Whisper model loading:** The `Audio2Feature` class was attempting to load the Whisper model from a hardcoded local path (`checkpoints/whisper/tiny.pt`) which does not exist. This caused a `RuntimeError`. The `model_path` in `latentsync/whisper/audio2feature.py` has been changed from `"checkpoints/whisper/tiny.pt"` to `"tiny"`. This allows the Whisper library's `load_model` function to correctly identify the model by name and handle its download and caching automatically. 2. **Improves error handling in inference:** The `predict.py` script used `os.system()` to call the main inference script (`scripts/inference.py`). This did not check the exit code of the subprocess, potentially masking errors if the inference script failed. `os.system()` has been replaced with `subprocess.check_call()`, which will raise a `CalledProcessError` if the inference script returns a non-zero exit code. This ensures that failures during the actual model inference are properly propagated and reported in the RunPod logs. These changes should resolve the `RuntimeError: Model checkpoints/whisper/tiny.pt not found` and provide more robust error reporting.

This commit addresses issues related to checkpoint and model loading: 1. **Ensure `setup_env.sh` Execution:** The `Dockerfile` was modified to execute `setup_env.sh` during the image build. This script downloads specific critical checkpoints like `whisper/tiny.pt` and `latentsync_unet.pt` from Hugging Face into the `checkpoints/` directory. 2. **Correct Whisper Model Path:** The `latentsync/whisper/audio2feature.py` was updated to use `model_path="checkpoints/whisper/tiny.pt"`, aligning with the file downloaded by `setup_env.sh`. 3. **Ensure Main Model Archive (`model.tar`) is Always Downloaded:** Modified `predict.py`'s `setup()` method to always call `download_weights(MODEL_URL, MODEL_CACHE)`. This removes a previous condition that skipped downloading `model.tar` if the `checkpoints/` directory already existed (e.g., due to `setup_env.sh`). This is crucial because `model.tar` contains auxiliary models (e.g., for face detection, VGG) that are needed for symbolic linking and are not covered by `setup_env.sh`. The extraction of `model.tar` will occur after `setup_env.sh` has placed its files, potentially overwriting some, which is acceptable as `model.tar` is considered more comprehensive for the auxiliary files. 4. **Path Confirmation and Error Handling Kept:** - Confirmed `predict.py` uses the correct path for `latentsync_unet.pt`. - Retained the improved error handling in `predict.py` that uses `subprocess.check_call()` for more robust error reporting. These changes aim to create a more reliable setup process where all necessary model components and checkpoints are correctly downloaded and placed for your application to run on RunPod Serverless.

This commit addresses an error during the Docker build process where `setup_env.sh` failed because `hf_transfer` was not available. The `huggingface-cli` attempts to use `hf_transfer` for faster downloads when `HF_HUB_ENABLE_HF_TRANSFER=1` is set (which it is in this Dockerfile). However, the `hf_transfer` package was not installed. This commit modifies the `Dockerfile` to include `hf_transfer` in the `pip install` command, ensuring it's available in the environment. This should allow `setup_env.sh` to execute successfully and download the necessary checkpoints using the faster transfer method.

This commit addresses a `RuntimeError: stack expects a non-empty TensorList` which was preceded by ffmpeg-related errors: "Unrecognized option 'crf'" and "Error: Could not open video." The errors indicated a problem with processing the input video. The `latentsync/utils/util.py` file contained an `ffmpeg` command within its `read_video` function, specifically for changing the video's FPS to 25. This command included `-crf 18` as an output option. While `-crf` is a valid ffmpeg option for encoding, it's suspected that in the specific execution environment or ffmpeg version within the Docker container, this option (or its interaction with other options in that command) was causing ffmpeg to fail and not produce a valid temporary video file. This failure then led to `cv2.VideoCapture` not being able to open the video, resulting in no frames for face detection, and ultimately the `torch.stack` error. This commit removes the `-crf 18` option from this specific intermediate ffmpeg command. FFmpeg will use its default quality settings for this temporary transcoding. The final video encoding in `lipsync_pipeline.py` still uses `-crf 18` appropriately. This change aims to allow the intermediate video processing to complete successfully, enabling proper video frame loading and subsequent face detection.

This commit addresses the "Unrecognized option 'crf'" error that occurred during the final ffmpeg command in the lipsync pipeline. The issue was likely caused by the Python script not running within the 'latentsync' conda environment, where a fully-featured ffmpeg was installed by `setup_env.sh`. Instead, it was likely using your system's default ffmpeg, which might have been older or lacked certain features. The `Dockerfile`'s `CMD` instruction has been modified to use `conda run -n latentsync --no-capture-output python runpod_handler.py`. This ensures that the `runpod_handler.py` script, and consequently any ffmpeg processes it invokes via `subprocess`, will use the ffmpeg version from the 'latentsync' conda environment. This change should make the correct ffmpeg version available to your application, resolving the issues with unrecognized options like '-crf'.

This commit modifies `setup_env.sh` to address issues with `conda activate` not working as expected when the script is executed by Docker's `RUN` instruction. The `conda activate latentsync` line has been removed. Instead, all commands that need to operate within the `latentsync` conda environment (such as `conda install`, `pip install`, and `huggingface-cli download`) are now prefixed with `conda run -n latentsync --no-capture-output`. This change ensures that these commands correctly use the specified conda environment, leading to reliable installation of dependencies (including `cog` and the correct `ffmpeg` version from `conda-forge`) during the Docker build process. This should resolve previous errors related to `conda activate` failing and subsequent `ModuleNotFound` or incorrect `ffmpeg` versions being used.

This commit updates `setup_env.sh` to improve the reliability of commands executed within the Conda environment during Docker builds. Changes include: 1. Corrected the `conda install` syntax to use the `-n <envname>` flag directly, instead of wrapping with `conda run`. 2. Modified `pip install` to be invoked via `python -m pip install ...` within `conda run`. This ensures the `pip` associated with the Conda environment's Python is used. 3. Modified `huggingface-cli download` to be invoked via `python -m huggingface_hub.commands.cli download ...` within `conda run`. This ensures the CLI commands from the `huggingface-hub` package are correctly found and executed using the environment's Python. These changes are intended to prevent "command not found" (exit code 127) errors that can occur if shell activation or PATH issues prevent executables like `pip` or `huggingface-cli` from being found directly during scripted Conda operations in Docker.

google-labs-jules bot and others added 30 commits May 22, 2025 06:06

Merge pull request #1 from ggmarts04/feat/runpod-serverless-adapter

f91d67a

feat: Adapt model for RunPod serverless deployment

Update README.md

d0db9e1

Update README.md

468e3fb

Update Dockerfile

d9234a7

Merge pull request #2 from ggmarts04/feat/runpod-serverless-adapter

fbe2884

fix: Add curl to Dockerfile system dependencies

Update Dockerfile

067ba32

Update Dockerfile

a714971

Update requirements.txt

9f0540e

Update runpod_handler.py

be85e58

Update predict.py

74445bb

Update runpod_handler.py

e14f8c1

Update Dockerfile

a780448

Update predict.py

3410dcf

Update Dockerfile

31e3c3a

Update util.py

2d30160

Update Dockerfile

bb8917c

Update Dockerfile

52da7a6

Update util.py

3d16327

Update Dockerfile

801d541

Update Dockerfile

6fe171b

Update Dockerfile

3f8ffd6

Update Dockerfile

d4d7e86

ggmarts04 and others added 4 commits May 22, 2025 23:41

Update Dockerfile

71458d4

Merge branch 'main' into fix-runpod-errors

8ec685e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix runpod errors #258

Fix runpod errors #258

Uh oh!

ggmarts04 commented May 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Fix runpod errors #258

Are you sure you want to change the base?

Fix runpod errors #258

Uh oh!

Conversation

ggmarts04 commented May 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant