You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@Yangsenqiao First of all, this is an excellent piece of work, and I really appreciate you making the code publicly available.
I noticed that in your demonstration of how VisionZip improves video understanding speed, you used LLAVA_OneVision. However, I encountered some compatibility issues when deploying it locally. For example:
The prepare_inputs_labels_for_multimodal_visionzip function does not accept the **modalities** parameter as input, whereas the corresponding function in LLAVA includes this parameter.
LLAVA_OneVision uses the SigLIP vision encoder (corresponding to **SigLipVisionTower**), but the code you provided is only adapted for **CLIPVisionTower**.
Could you please share the code related to LLAVA_OneVision that you used in the demonstration?
No description provided.
The text was updated successfully, but these errors were encountered: