-
-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Description
Hello!
I am currently working with the fine-tuned LLaMA 3.2-Vision model in my project and I'm interested in extracting image-text fusion features for downstream tasks. Specifically, I would like to know if it's possible to extract these fusion features from the current architecture or if additional modifications would be required.
Here are some details about my setup:
-
I have already fine-tuned with unsloth for the LLaMA 3.2-Vision model for specific tasks like image caption in my project.
-
I aim to extract features that represent both the image and its corresponding textual description, as this would be useful for further multimodal processing.
Could you provide any guidance on:
-
How to access or extract the image-text fusion features from the existing model?
-
If modifications to the current architecture are necessary, what would you recommend?
-
Any examples or references to relevant code that could assist in this process?
Thank you for your time and help!
Best regards!