Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The performance of adopting the Video Swin Transformer as offline-extracted video encoder? #24

Open
3DMM-ICME2022 opened this issue Jul 6, 2022 · 2 comments

Comments

@3DMM-ICME2022
Copy link

3DMM-ICME2022 commented Jul 6, 2022

Hi, Thanks for your nice work! @kevinlin311tw

However, could you please report the performance of adopting the Video Swin Transformer as an offline-extracted video encoder?

In Table 2, other methods adopt C3D or I3D while yours use Video Swin Transformer, it is not fair comparison, right?

@JoseponLee
Copy link

Same question about the performance of adopting the Video Swin Transformer as an offline-extracted video encoder, Maybe the author can provide 3D features after Video Swin Transformer feature extraction

@kevinlin311tw
Copy link
Member

kevinlin311tw commented Sep 27, 2022

image
In the appendix, we show a comparison where both approaches use the same SlowFast as the backbone for feature extraction. Our method achieves better performance than VALUE.

We also compare with the full version of VALUE, which uses both CLIP-ViT and SlowFast as backbones. Although our video backbone uses less pre-training data than VALUE, we achieve better caption performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants