Extracting Image-Text Fusion Features from Fine-Tuned LLaMA 3.2-Vision Architecture

Hello!
I am currently working with the fine-tuned LLaMA 3.2-Vision model in my project and I'm interested in extracting image-text fusion features for downstream tasks. Specifically, I would like to know if it's possible to extract these fusion features from the current architecture or if additional modifications would be required.

Here are some details about my setup:

- I have already fine-tuned with unsloth for the LLaMA 3.2-Vision model for specific tasks like image caption in my project. 

- I aim to extract features that represent both the image and its corresponding textual description, as this would be useful for further multimodal processing.

Could you provide any guidance on:

1. How to access or extract the image-text fusion features from the existing model?

2. If modifications to the current architecture are necessary, what would you recommend?

3. Any examples or references to relevant code that could assist in this process?


Thank you for your time and help!

Best regards!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Extracting Image-Text Fusion Features from Fine-Tuned LLaMA 3.2-Vision Architecture #1464

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Extracting Image-Text Fusion Features from Fine-Tuned LLaMA 3.2-Vision Architecture #1464

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions