Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extracting Image-Text Fusion Features from Fine-Tuned LLaMA 3.2-Vision Architecture #1464

Open
Linn0910 opened this issue Dec 22, 2024 · 1 comment

Comments

@Linn0910
Copy link

Hello!
I am currently working with the fine-tuned LLaMA 3.2-Vision model in my project and I'm interested in extracting image-text fusion features for downstream tasks. Specifically, I would like to know if it's possible to extract these fusion features from the current architecture or if additional modifications would be required.

Here are some details about my setup:

  • I have already fine-tuned with unsloth for the LLaMA 3.2-Vision model for specific tasks like image caption in my project.

  • I aim to extract features that represent both the image and its corresponding textual description, as this would be useful for further multimodal processing.

Could you provide any guidance on:

  1. How to access or extract the image-text fusion features from the existing model?

  2. If modifications to the current architecture are necessary, what would you recommend?

  3. Any examples or references to relevant code that could assist in this process?

Thank you for your time and help!

Best regards!

@shimmyshimmer
Copy link
Collaborator

I would recommend asking in our Discord server if you'd like more customized help: https://discord.com/invite/unsloth

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants