Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Combining Features of Swin Transformer and other Features #53

Open
adeljalalyousif opened this issue Nov 9, 2023 · 0 comments
Open

Comments

@adeljalalyousif
Copy link

adeljalalyousif commented Nov 9, 2023

Hello, thank you for sharing your code. Can you help me in this scenario:
For a video captioning model, I have sampled each video with 16 frames. I've employed a Video Swin Transformer to extract video features, resulting in a tensor of shape (batch_size, 768, 4, 7, 7). Additionally, I've used a 2D-CNN to extract frame-level features, resulting in a tensor of shape (batch_size, 16, 768). Now, I need to concatenate these two sets of features to create a combined representation with a shape like (batch_size, seq_len, feat_dim ) or similar, in order to use them with a traditional transformer encoder-decoder (self-attention and multi-headed attention). How can I achieve this concatenation for seamless integration into the model?
or simply how the word embedding vector (batch_size, seq_len, word_dim) and the Swin tensor (batch_size, 768, 4, 7, 7) are preprocessed before passing them to self-attention or multi-headed attention blocks of the traditional transformer

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant