Llava-OV video inference #1330
-
Hi, thanks for open sourcing this excellent work. It seems as though we force multi-frame inference (without pooling the visual features) when the number of video frames is less than 16: sglang/python/sglang/srt/models/llava.py Line 293 in 12cb115 This means that the input context length is several times larger than it should be for video. Based on its descriptions it also seems as though the model interprets them as images. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
fixed by #1346 |
Beta Was this translation helpful? Give feedback.
fixed by #1346