You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello!
I am currently working with the fine-tuned LLaMA 3.2-Vision model in my project and I'm interested in extracting image-text fusion features for downstream tasks. Specifically, I would like to know if it's possible to extract these fusion features from the current architecture or if additional modifications would be required.
Here are some details about my setup:
I have already fine-tuned with unsloth for the LLaMA 3.2-Vision model for specific tasks like image caption in my project.
I aim to extract features that represent both the image and its corresponding textual description, as this would be useful for further multimodal processing.
Could you provide any guidance on:
How to access or extract the image-text fusion features from the existing model?
If modifications to the current architecture are necessary, what would you recommend?
Any examples or references to relevant code that could assist in this process?
Thank you for your time and help!
Best regards!
The text was updated successfully, but these errors were encountered:
Hello!
I am currently working with the fine-tuned LLaMA 3.2-Vision model in my project and I'm interested in extracting image-text fusion features for downstream tasks. Specifically, I would like to know if it's possible to extract these fusion features from the current architecture or if additional modifications would be required.
Here are some details about my setup:
I have already fine-tuned with unsloth for the LLaMA 3.2-Vision model for specific tasks like image caption in my project.
I aim to extract features that represent both the image and its corresponding textual description, as this would be useful for further multimodal processing.
Could you provide any guidance on:
How to access or extract the image-text fusion features from the existing model?
If modifications to the current architecture are necessary, what would you recommend?
Any examples or references to relevant code that could assist in this process?
Thank you for your time and help!
Best regards!
The text was updated successfully, but these errors were encountered: