-
Notifications
You must be signed in to change notification settings - Fork 862
Fix runpod errors #258
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
ggmarts04
wants to merge
34
commits into
bytedance:main
Choose a base branch
from
ggmarts04:fix-runpod-errors
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Fix runpod errors #258
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This commit introduces changes to allow the model to be deployed on RunPod Serverless. Key changes include: - **`runpod_handler.py`**: A new handler script that serves as the entry point for RunPod. It processes video and audio URL inputs, downloads the files, and then uses the existing `Predictor` class to perform inference. - **`Dockerfile`**: A new Dockerfile to build the container image for RunPod. It includes system dependencies (ffmpeg, libgl1), Python packages from `requirements.txt`, the downloader, and sets the appropriate CMD for the RunPod Python environment. - **`requirements.txt`**: Added the `runpod` package, which is necessary for the RunPod serverless environment. - **`RUNPOD_DEPLOYMENT.md`**: New documentation detailing how to build the Docker image, deploy it to RunPod, and make requests to the serverless endpoint. The existing core inference logic in `predict.py` and `scripts/inference.py` remains largely unchanged, with the new handler acting as an adapter for the RunPod environment and URL-based inputs.
feat: Adapt model for RunPod serverless deployment
The previous Dockerfile was missing `curl`, which is required for downloading the `pget` tool during the image build process. This resulted in a "curl: not found" error. This commit adds `curl` to the `apt-get install` command in the Dockerfile to ensure it is available.
fix: Add curl to Dockerfile system dependencies
This commit addresses two main issues:
1. **Fixes `pget` 404 error during local testing:**
The `runpod_handler.py` script had a local testing block
that attempted to download a `CHANGELOG.md` file from a
URL that resulted in a 404 error. This was causing
RunPod's local test run to fail. The problematic URL in the
`mock_event` for `audio_url` has been changed to a valid one
(uses the README.md URL, same as `video_url` for test purposes).
2. **Prevents errors from existing symlinks:**
The `predict.py` script's `setup()` method attempted to create
symbolic links without checking if they already existed.
This could lead to errors if the script was run multiple
times or if the links were already present. The script has
been updated to check for the existence of these links
before attempting to create them.
These changes should improve the reliability of deploying and running this model on RunPod Serverless.
This commit addresses the following issues:
1. **Fixes Whisper model loading:**
The `Audio2Feature` class was attempting to load the Whisper
model from a hardcoded local path (`checkpoints/whisper/tiny.pt`)
which does not exist. This caused a `RuntimeError`.
The `model_path` in `latentsync/whisper/audio2feature.py` has been
changed from `"checkpoints/whisper/tiny.pt"` to `"tiny"`. This allows
the Whisper library's `load_model` function to correctly identify
the model by name and handle its download and caching automatically.
2. **Improves error handling in inference:**
The `predict.py` script used `os.system()` to call the main
inference script (`scripts/inference.py`). This did not check the
exit code of the subprocess, potentially masking errors if the
inference script failed. `os.system()` has been replaced with
`subprocess.check_call()`, which will raise a `CalledProcessError`
if the inference script returns a non-zero exit code. This ensures
that failures during the actual model inference are properly
propagated and reported in the RunPod logs.
These changes should resolve the `RuntimeError: Model checkpoints/whisper/tiny.pt not found` and provide more robust error reporting.
This commit addresses issues related to checkpoint and model loading:
1. **Ensure `setup_env.sh` Execution:**
The `Dockerfile` was modified to execute `setup_env.sh` during
the image build. This script downloads specific critical checkpoints
like `whisper/tiny.pt` and `latentsync_unet.pt` from Hugging Face
into the `checkpoints/` directory.
2. **Correct Whisper Model Path:**
The `latentsync/whisper/audio2feature.py` was updated to use
`model_path="checkpoints/whisper/tiny.pt"`, aligning with the file
downloaded by `setup_env.sh`.
3. **Ensure Main Model Archive (`model.tar`) is Always Downloaded:**
Modified `predict.py`'s `setup()` method to always call
`download_weights(MODEL_URL, MODEL_CACHE)`. This removes a
previous condition that skipped downloading `model.tar` if the
`checkpoints/` directory already existed (e.g., due to
`setup_env.sh`). This is crucial because `model.tar` contains
auxiliary models (e.g., for face detection, VGG) that are needed
for symbolic linking and are not covered by `setup_env.sh`.
The extraction of `model.tar` will occur after `setup_env.sh`
has placed its files, potentially overwriting some, which is
acceptable as `model.tar` is considered more comprehensive for
the auxiliary files.
4. **Path Confirmation and Error Handling Kept:**
- Confirmed `predict.py` uses the correct path for `latentsync_unet.pt`.
- Retained the improved error handling in `predict.py` that uses
`subprocess.check_call()` for more robust error reporting.
These changes aim to create a more reliable setup process where all
necessary model components and checkpoints are correctly downloaded and
placed for your application to run on RunPod Serverless.
This commit addresses an error during the Docker build process where `setup_env.sh` failed because `hf_transfer` was not available. The `huggingface-cli` attempts to use `hf_transfer` for faster downloads when `HF_HUB_ENABLE_HF_TRANSFER=1` is set (which it is in this Dockerfile). However, the `hf_transfer` package was not installed. This commit modifies the `Dockerfile` to include `hf_transfer` in the `pip install` command, ensuring it's available in the environment. This should allow `setup_env.sh` to execute successfully and download the necessary checkpoints using the faster transfer method.
This commit addresses a `RuntimeError: stack expects a non-empty TensorList` which was preceded by ffmpeg-related errors: "Unrecognized option 'crf'" and "Error: Could not open video." The errors indicated a problem with processing the input video. The `latentsync/utils/util.py` file contained an `ffmpeg` command within its `read_video` function, specifically for changing the video's FPS to 25. This command included `-crf 18` as an output option. While `-crf` is a valid ffmpeg option for encoding, it's suspected that in the specific execution environment or ffmpeg version within the Docker container, this option (or its interaction with other options in that command) was causing ffmpeg to fail and not produce a valid temporary video file. This failure then led to `cv2.VideoCapture` not being able to open the video, resulting in no frames for face detection, and ultimately the `torch.stack` error. This commit removes the `-crf 18` option from this specific intermediate ffmpeg command. FFmpeg will use its default quality settings for this temporary transcoding. The final video encoding in `lipsync_pipeline.py` still uses `-crf 18` appropriately. This change aims to allow the intermediate video processing to complete successfully, enabling proper video frame loading and subsequent face detection.
This commit addresses the "Unrecognized option 'crf'" error that occurred during the final ffmpeg command in the lipsync pipeline. The issue was likely caused by the Python script not running within the 'latentsync' conda environment, where a fully-featured ffmpeg was installed by `setup_env.sh`. Instead, it was likely using your system's default ffmpeg, which might have been older or lacked certain features. The `Dockerfile`'s `CMD` instruction has been modified to use `conda run -n latentsync --no-capture-output python runpod_handler.py`. This ensures that the `runpod_handler.py` script, and consequently any ffmpeg processes it invokes via `subprocess`, will use the ffmpeg version from the 'latentsync' conda environment. This change should make the correct ffmpeg version available to your application, resolving the issues with unrecognized options like '-crf'.
This commit modifies `setup_env.sh` to address issues with `conda activate` not working as expected when the script is executed by Docker's `RUN` instruction. The `conda activate latentsync` line has been removed. Instead, all commands that need to operate within the `latentsync` conda environment (such as `conda install`, `pip install`, and `huggingface-cli download`) are now prefixed with `conda run -n latentsync --no-capture-output`. This change ensures that these commands correctly use the specified conda environment, leading to reliable installation of dependencies (including `cog` and the correct `ffmpeg` version from `conda-forge`) during the Docker build process. This should resolve previous errors related to `conda activate` failing and subsequent `ModuleNotFound` or incorrect `ffmpeg` versions being used.
This commit updates `setup_env.sh` to improve the reliability of
commands executed within the Conda environment during Docker builds.
Changes include:
1. Corrected the `conda install` syntax to use the `-n <envname>`
flag directly, instead of wrapping with `conda run`.
2. Modified `pip install` to be invoked via `python -m pip install ...`
within `conda run`. This ensures the `pip` associated with the
Conda environment's Python is used.
3. Modified `huggingface-cli download` to be invoked via
`python -m huggingface_hub.commands.cli download ...` within
`conda run`. This ensures the CLI commands from the `huggingface-hub`
package are correctly found and executed using the environment's Python.
These changes are intended to prevent "command not found" (exit code 127)
errors that can occur if shell activation or PATH issues prevent
executables like `pip` or `huggingface-cli` from being found directly
during scripted Conda operations in Docker.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.