Combining Features of Swin Transformer and other Features #53

adeljalalyousif · 2023-11-09T07:09:26Z

Hello, thank you for sharing your code. Can you help me in this scenario:
For a video captioning model, I have sampled each video with 16 frames. I've employed a Video Swin Transformer to extract video features, resulting in a tensor of shape (batch_size, 768, 4, 7, 7). Additionally, I've used a 2D-CNN to extract frame-level features, resulting in a tensor of shape (batch_size, 16, 768). Now, I need to concatenate these two sets of features to create a combined representation with a shape like (batch_size, seq_len, feat_dim ) or similar, in order to use them with a traditional transformer encoder-decoder (self-attention and multi-headed attention). How can I achieve this concatenation for seamless integration into the model?
or simply how the word embedding vector (batch_size, seq_len, word_dim) and the Swin tensor (batch_size, 768, 4, 7, 7) are preprocessed before passing them to self-attention or multi-headed attention blocks of the traditional transformer

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Combining Features of Swin Transformer and other Features #53

Combining Features of Swin Transformer and other Features #53

adeljalalyousif commented Nov 9, 2023 •

edited

Loading

Combining Features of Swin Transformer and other Features #53

Combining Features of Swin Transformer and other Features #53

Comments

adeljalalyousif commented Nov 9, 2023 • edited Loading

adeljalalyousif commented Nov 9, 2023 •

edited

Loading