Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support llava_onevision #7

Open
defaultak01 opened this issue Dec 13, 2024 · 2 comments
Open

support llava_onevision #7

defaultak01 opened this issue Dec 13, 2024 · 2 comments

Comments

@defaultak01
Copy link

No description provided.

@defaultak01 defaultak01 changed the title support llava support llava_onevision Dec 13, 2024
@defaultak01
Copy link
Author

defaultak01 commented Dec 13, 2024

@Yangsenqiao First of all, this is an excellent piece of work, and I really appreciate you making the code publicly available.

I noticed that in your demonstration of how VisionZip improves video understanding speed, you used LLAVA_OneVision. However, I encountered some compatibility issues when deploying it locally. For example:

  1. The prepare_inputs_labels_for_multimodal_visionzip function does not accept the **modalities** parameter as input, whereas the corresponding function in LLAVA includes this parameter.
  2. LLAVA_OneVision uses the SigLIP vision encoder (corresponding to **SigLipVisionTower**), but the code you provided is only adapted for **CLIPVisionTower**.

Could you please share the code related to LLAVA_OneVision that you used in the demonstration?

@effortprogrammer
Copy link

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants