-
Notifications
You must be signed in to change notification settings - Fork 76
inference: add CLI overrides for fps/total_pixels and vLLM memory knobs to prevent CUDA OOM in video runs #83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
inference: add CLI overrides for fps/total_pixels and vLLM memory knobs to prevent CUDA OOM in video runs #83
Conversation
|
Thanks! This is excellent information. This script was meant to be a starting example that you can copy and modify. We are looking into adding full-fledged scripts (e.g. offline batch inference, online server). For those, we will expose all config settings in the CLI via |
|
Thanks @spectralflight for the review and the clear guidance! I’ve updated the PR accordingly:
These overrides should still make it easier for users running on smaller GPUs to tune video workloads without modifying repo configs, while keeping Cosmos-Reason1’s defaults intact. Happy to adjust anything else if needed! Best Regards |
|
Awesome, this will be very useful! Looks like the PR is failing linting. Could you please run (requires just): |
|
@spectralflight Thanks for your suggestion. The pre-commit checks were failing because |
|
@spectralflight @jingyijin2 Do you have any other models? I am working now on the Cosmos-CookBoook Recipe. It would be great to merge this into the main repo to provide clear instructions for developers. Thanks and Best Regards. |
Context
Running Cosmos-Reason1 VL on 24 GiB GPUs with video inputs can OOM during vLLM's profile run
inside Qwen2.5-VL's visual tower (see trace in qwen2_5_vl.py → _process_video_input → self.visual).
This happens before generation when allocations scale with #frames × frame area × hidden size.
What this PR changes (code)
These changes let you adjust video sampling and leave headroom for the vision tower without touching repo configs.
Not included (operational guidance only; no code/config committed)
fps: 1total_pixels: 2_000_000PYTORCH_CUDA_ALLOC_CONF=expandable_segments:Trueenv=os.environ)How to test
Expected:
Model loads without OOM.
vLLM’s profile run succeeds.
Generation returns an answer and optional reasoning.
Rationale
The OOM occurs when the vision tower processes the sampled video frames. Providing light-weight knobs
(fps/total_pixels) plus vLLM headroom (gpu_memory_utilization/max_model_len) makes the pipeline usable
on common –24 GiB GPUs. Sometimes it is necessary to reduce
fpsandtotal_pixelsin the configuration files.