Skip to content

Extracting Image-Text Fusion Features from Fine-Tuned LLaMA 3.2-Vision Architecture #1464

@Linn0910

Description

@Linn0910

Hello!
I am currently working with the fine-tuned LLaMA 3.2-Vision model in my project and I'm interested in extracting image-text fusion features for downstream tasks. Specifically, I would like to know if it's possible to extract these fusion features from the current architecture or if additional modifications would be required.

Here are some details about my setup:

  • I have already fine-tuned with unsloth for the LLaMA 3.2-Vision model for specific tasks like image caption in my project.

  • I aim to extract features that represent both the image and its corresponding textual description, as this would be useful for further multimodal processing.

Could you provide any guidance on:

  1. How to access or extract the image-text fusion features from the existing model?

  2. If modifications to the current architecture are necessary, what would you recommend?

  3. Any examples or references to relevant code that could assist in this process?

Thank you for your time and help!

Best regards!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions