You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello, thank you for sharing your code. Can you help me in this scenario:
For a video captioning model, I have sampled each video with 16 frames. I've employed a Video Swin Transformer to extract video features, resulting in a tensor of shape (batch_size, 768, 4, 7, 7). Additionally, I've used a 2D-CNN to extract frame-level features, resulting in a tensor of shape (batch_size, 16, 768). Now, I need to concatenate these two sets of features to create a combined representation with a shape like (batch_size, seq_len, feat_dim ) or similar, in order to use them with a traditional transformer encoder-decoder (self-attention and multi-headed attention). How can I achieve this concatenation for seamless integration into the model?
or simply how the word embedding vector (batch_size, seq_len, word_dim) and the Swin tensor (batch_size, 768, 4, 7, 7) are preprocessed before passing them to self-attention or multi-headed attention blocks of the traditional transformer
The text was updated successfully, but these errors were encountered:
Hello, thank you for sharing your code. Can you help me in this scenario:
For a video captioning model, I have sampled each video with 16 frames. I've employed a Video Swin Transformer to extract video features, resulting in a tensor of shape (batch_size, 768, 4, 7, 7). Additionally, I've used a 2D-CNN to extract frame-level features, resulting in a tensor of shape (batch_size, 16, 768). Now, I need to concatenate these two sets of features to create a combined representation with a shape like (batch_size, seq_len, feat_dim ) or similar, in order to use them with a traditional transformer encoder-decoder (self-attention and multi-headed attention). How can I achieve this concatenation for seamless integration into the model?
or simply how the word embedding vector (batch_size, seq_len, word_dim) and the Swin tensor (batch_size, 768, 4, 7, 7) are preprocessed before passing them to self-attention or multi-headed attention blocks of the traditional transformer
The text was updated successfully, but these errors were encountered: