You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Same question about the performance of adopting the Video Swin Transformer as an offline-extracted video encoder, Maybe the author can provide 3D features after Video Swin Transformer feature extraction
In the appendix, we show a comparison where both approaches use the same SlowFast as the backbone for feature extraction. Our method achieves better performance than VALUE.
We also compare with the full version of VALUE, which uses both CLIP-ViT and SlowFast as backbones. Although our video backbone uses less pre-training data than VALUE, we achieve better caption performance.
Hi, Thanks for your nice work! @kevinlin311tw
However, could you please report the performance of adopting the Video Swin Transformer as an offline-extracted video encoder?
In Table 2, other methods adopt C3D or I3D while yours use Video Swin Transformer, it is not fair comparison, right?
The text was updated successfully, but these errors were encountered: