You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi thanks for your great work, exploring BEIT as an alternative to CLIP.
I find it very well motivated in the paper, but I struggle to reproduce the BEIT3 results in my independent training codebase.
So far I can match / surpass clip results, and the addition of CLIP_Image in Late Concat is beneficial.
However, so far BEIT3 underperforms clip. So I'm wondering if I am missing something.
For your BEIT experiments, what do you mean by Late Concat and Early(L1-L12), Early(L1-L24)? I can't find reference to this in the code, and neither in the beit repo or torchscale repo. If you could share a code sample you would really help to articulate your point
Thank you for your time
The text was updated successfully, but these errors were encountered:
Hi, thank you for reproducing our work!
Our BEIT experiment is to prove the effectiveness of "early-fusion", where "late" means use beit3 to extract separate single-modal feat and concat them. "Early(L1-L12)" means we only enable cross-modal attention in layer1~layer12 of beit3. "Early(L1-L24)" means we enable cross-modal attention in all layers of beit3, which is the original beit3. We implement by manually adding a masked attention map in beit3 source code.
If you can provide part of your training codebase (both by issue or by email are ok), I can help you look for problems and fix bugs. But due to company policy, I cannot directly upload the training codebase.
@CoderZhangYx Thank you for the swift reply!
You've cleared up a good deal of confusion for me.
Not sure I'll be able to share code, but great to have that option.
For you experiments with CLIP did you also unfreeze the model?
Hi thanks for your great work, exploring BEIT as an alternative to CLIP.
I find it very well motivated in the paper, but I struggle to reproduce the BEIT3 results in my independent training codebase.
So far I can match / surpass clip results, and the addition of CLIP_Image in Late Concat is beneficial.
However, so far BEIT3 underperforms clip. So I'm wondering if I am missing something.
For your BEIT experiments, what do you mean by Late Concat and Early(L1-L12), Early(L1-L24)? I can't find reference to this in the code, and neither in the beit repo or torchscale repo. If you could share a code sample you would really help to articulate your point
Thank you for your time
The text was updated successfully, but these errors were encountered: