Llava-OV video inference #1330

ehayeshaiper · 2024-09-04T13:41:29Z

ehayeshaiper
Sep 4, 2024

Hi, thanks for open sourcing this excellent work.
I have a concern about video inference on short (<16 frames) videos:

It seems as though we force multi-frame inference (without pooling the visual features) when the number of video frames is less than 16:

sglang/python/sglang/srt/models/llava.py

Line 293 in 12cb115

if image_feature.shape[0] > 16: # video

This means that the input context length is several times larger than it should be for video. Based on its descriptions it also seems as though the model interprets them as images.
Can we confirm what the correct behaviour is for short videos?
Thanks

Answered by merrymercy

Sep 10, 2024

fixed by #1346

View full answer

merrymercy · 2024-09-10T07:33:39Z

merrymercy
Sep 10, 2024
Maintainer

fixed by #1346

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Llava-OV video inference #1330

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Llava-OV video inference #1330

ehayeshaiper Sep 4, 2024

Replies: 1 comment

merrymercy Sep 10, 2024 Maintainer

ehayeshaiper
Sep 4, 2024

merrymercy
Sep 10, 2024
Maintainer