Block diagram representation of the model #290

jaytxrx · 2023-07-06T21:26:36Z

jaytxrx
Jul 6, 2023

Hi,

Thanks a lot for this awesome project. I am a beginner to ML and obviously to transformers. I saw this below text

The model consist of a ViT [1] encoder with a ResNet backbone and a Transformer [2] decoder.

I didnt understand how this backbone looks like. Do you have a prictorial representation ? I checked the code and couldn't understand much. Is it like any of the below image links ?

May be can you explain the data flow ? For example is the image first fed to ResnetV2 and then its output is fed to ViT ? Or the ResnetV2 and the ViT processes the image in parallel and at the end the output is fed to the decoder ? If they are connected in parallel, are there any connections between the models(ResnetV2 and ViT) in between or the connections are only at the end i.e at their outputs ?

https://www.researchgate.net/figure/The-illustration-of-components-in-Encoder-which-use-ResNet-18-as-the-backbone-network_fig2_365374515
https://production-media.paperswithcode.com/methods/Screen_Shot_2020-07-20_at_9.17.39_PM_ZHS2kmV.png
https://raw.githubusercontent.com/datvuthanh/HybridNets/main/images/hybridnets.jpg

Answered by lukas-blecher

Jul 11, 2023

Hi, sorry for the late reply, I didn't see the question.

The backbone is just a rather simple ResNet as feature extractor (basically just a couple of conv layers with residual connections). The input image is fed into this CNN. The output is a smaller feature map which is then split into patches and fed into the ViT. The architecture is already described in the original ViT paper (https://arxiv.org/pdf/2010.11929.pdf) in section 3.1
The second image (https://production-media.paperswithcode.com/methods/Screen_Shot_2020-07-20_at_9.17.39_PM_ZHS2kmV.png) looks very similar to the architecture.

View full answer

lukas-blecher · 2023-07-11T10:17:47Z

lukas-blecher
Jul 11, 2023
Maintainer

Hi, sorry for the late reply, I didn't see the question.

The backbone is just a rather simple ResNet as feature extractor (basically just a couple of conv layers with residual connections). The input image is fed into this CNN. The output is a smaller feature map which is then split into patches and fed into the ViT. The architecture is already described in the original ViT paper (https://arxiv.org/pdf/2010.11929.pdf) in section 3.1
The second image (https://production-media.paperswithcode.com/methods/Screen_Shot_2020-07-20_at_9.17.39_PM_ZHS2kmV.png) looks very similar to the architecture.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Block diagram representation of the model #290

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Block diagram representation of the model #290

jaytxrx Jul 6, 2023

Replies: 1 comment

lukas-blecher Jul 11, 2023 Maintainer

jaytxrx
Jul 6, 2023

lukas-blecher
Jul 11, 2023
Maintainer