Skip to content

Conversation

@paularamo
Copy link

@paularamo paularamo commented Nov 14, 2025

Context

Running Cosmos-Reason1 VL on 24 GiB GPUs with video inputs can OOM during vLLM's profile run
inside Qwen2.5-VL's visual tower (see trace in qwen2_5_vl.py → _process_video_input → self.visual).
This happens before generation when allocations scale with #frames × frame area × hidden size.

What this PR changes (code)

  • Add CLI overrides to avoid editing YAMLs:
    • --fps
    • --vision-total-pixels
  • Add vLLM memory-safety knobs:
    • --gpu-memory-utilization (default 0.70)
    • --max-model-len (default 2048)
  • Apply overrides on top of loaded vision_config, with schema re-validation
  • Pass the memory knobs through to vllm.LLM

These changes let you adjust video sampling and leave headroom for the vision tower without touching repo configs.

Not included (operational guidance only; no code/config committed)

  • Vision config values used in our runs:
    • fps: 1
    • total_pixels: 2_000_000
  • Environment variable to reduce fragmentation:
    • PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
  • Notebook/runtime fixes:
    • Ensure the env var is present in the subprocess (env=os.environ)
    • Avoid duplicated/glued CLI flags when building command strings

How to test

export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
python scripts/inference.py \
  --prompt prompts/question.yaml \
  --question "What are the potential safety hazards?" \
  --reasoning \
  --videos assets/sample.mp4 \
  --vision-config, configs/vision_config.yaml \
  --gpu-memory-utilization 0.70 \
  --max-model-len 1536 \
  -v

Expected:

Model loads without OOM.
vLLM’s profile run succeeds.
Generation returns an answer and optional reasoning.

Rationale

The OOM occurs when the vision tower processes the sampled video frames. Providing light-weight knobs
(fps/total_pixels) plus vLLM headroom (gpu_memory_utilization/max_model_len) makes the pipeline usable
on common –24 GiB GPUs. Sometimes it is necessary to reduce fps and total_pixels in the configuration files.

@pjannaty pjannaty requested a review from jingyijin2 November 18, 2025 00:07
@spectralflight spectralflight self-requested a review November 18, 2025 03:47
@spectralflight
Copy link
Collaborator

Thanks! This is excellent information.

This script was meant to be a starting example that you can copy and modify. We are looking into adding full-fledged scripts (e.g. offline batch inference, online server). For those, we will expose all config settings in the CLI via tyro.

@paularamo
Copy link
Author

Thanks @spectralflight for the review and the clear guidance!

I’ve updated the PR accordingly:

  • Removed the temporary debug printouts, and
  • Set the memory-related CLI parameters to default to None so we preserve the existing behavior, as suggested.

These overrides should still make it easier for users running on smaller GPUs to tune video workloads without modifying repo configs, while keeping Cosmos-Reason1’s defaults intact. Happy to adjust anything else if needed!

Best Regards

@spectralflight
Copy link
Collaborator

Awesome, this will be very useful! Looks like the PR is failing linting. Could you please run (requires just):

just lint
just test

@paularamo
Copy link
Author

@spectralflight Thanks for your suggestion. The pre-commit checks were failing because reasoning.txt was missing a required end-of-file newline, and scripts/inference.py had a shebang but wasn’t marked executable. I fixed both issues by applying the newline correction and updating the file permissions to make the script executable. After these changes, the pre-commit hooks pass as expected. Thanks for your support on this PR.

@paularamo
Copy link
Author

@spectralflight @jingyijin2 Do you have any other models? I am working now on the Cosmos-CookBoook Recipe. It would be great to merge this into the main repo to provide clear instructions for developers. Thanks and Best Regards.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants