Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ask detailed questions about the permutation strategy in the RAR #48

Open
eanson023 opened this issue Nov 6, 2024 · 4 comments
Open

Comments

@eanson023
Copy link

Hello author, I am very grateful for your excellent work and generous open source. After reading your source code, I have a small question, why the permute strategy does not shuffle the initial condition? Anyway, you will return to the original permutation after annealing. Here are some of the source codes that I am confused about:

 # cls_token, condition, the permute does not impact these prefix tokens.
prefix = 2
pos_embed_prefix = pos_embed[:, :prefix]
pos_embed_postfix = self.shuffle(pos_embed[:, prefix:prefix+self.image_seq_len], orders)

Is it because of the contribution of target-aware positional embedding, it doesn't matter if it is not shuffled?

@pansanity666
Copy link

pansanity666 commented Nov 6, 2024

I think it is because that the prefix does not belong to the fine-grained image content.

@cornettoyu
Copy link
Collaborator

cornettoyu commented Nov 7, 2024

Thanks, as @pansanity666 said, prefix refers to class token and condition token, while we only permute the image tokens (referred as postfix in the code).

@Jiawei-Yang
Copy link

Thanks for your great work! @cornettoyu I’m wondering about the difference between cls_token and condition_token. Shouldn't the condition_token just be some class_id tokens with a randomly augmented none_class token? Is there any particular reason to use a separate cls_token?

Thanks!

@cornettoyu
Copy link
Collaborator

Thanks for your great work! @cornettoyu I’m wondering about the difference between cls_token and condition_token. Shouldn't the condition_token just be some class_id tokens with a randomly augmented none_class token? Is there any particular reason to use a separate cls_token?

Thanks!

Hi,

I see the terms could be abused and misleading. To clarify, here condition_token indicates the external condition input for guiding generation (e.g., ImageNet class), while cls_token refers to the learnable placehold token, as is used by the original ViT paper. Since we try to minimize the architecture change, we keep the cls_token as in the original ViT, but it should be fine removing it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants