Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BEIT-3-Large - Layer fusion #38

Open
MarcoForte opened this issue Nov 15, 2024 · 2 comments
Open

BEIT-3-Large - Layer fusion #38

MarcoForte opened this issue Nov 15, 2024 · 2 comments

Comments

@MarcoForte
Copy link

MarcoForte commented Nov 15, 2024

Hi thanks for your great work, exploring BEIT as an alternative to CLIP.

I find it very well motivated in the paper, but I struggle to reproduce the BEIT3 results in my independent training codebase.
So far I can match / surpass clip results, and the addition of CLIP_Image in Late Concat is beneficial.

However, so far BEIT3 underperforms clip. So I'm wondering if I am missing something.

For your BEIT experiments, what do you mean by Late Concat and Early(L1-L12), Early(L1-L24)? I can't find reference to this in the code, and neither in the beit repo or torchscale repo. If you could share a code sample you would really help to articulate your point

Thank you for your time

@CoderZhangYx
Copy link
Collaborator

Hi, thank you for reproducing our work!
Our BEIT experiment is to prove the effectiveness of "early-fusion", where "late" means use beit3 to extract separate single-modal feat and concat them. "Early(L1-L12)" means we only enable cross-modal attention in layer1~layer12 of beit3. "Early(L1-L24)" means we enable cross-modal attention in all layers of beit3, which is the original beit3. We implement by manually adding a masked attention map in beit3 source code.
If you can provide part of your training codebase (both by issue or by email are ok), I can help you look for problems and fix bugs. But due to company policy, I cannot directly upload the training codebase.

@MarcoForte
Copy link
Author

@CoderZhangYx Thank you for the swift reply!
You've cleared up a good deal of confusion for me.
Not sure I'll be able to share code, but great to have that option.

For you experiments with CLIP did you also unfreeze the model?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants