Block diagram representation of the model #290
-
Hi, Thanks a lot for this awesome project. I am a beginner to ML and obviously to transformers. I saw this below text
I didnt understand how this backbone looks like. Do you have a prictorial representation ? I checked the code and couldn't understand much. Is it like any of the below image links ? May be can you explain the data flow ? For example is the image first fed to ResnetV2 and then its output is fed to ViT ? Or the ResnetV2 and the ViT processes the image in parallel and at the end the output is fed to the decoder ? If they are connected in parallel, are there any connections between the models(ResnetV2 and ViT) in between or the connections are only at the end i.e at their outputs ? https://www.researchgate.net/figure/The-illustration-of-components-in-Encoder-which-use-ResNet-18-as-the-backbone-network_fig2_365374515 |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
Hi, sorry for the late reply, I didn't see the question. The backbone is just a rather simple ResNet as feature extractor (basically just a couple of conv layers with residual connections). The input image is fed into this CNN. The output is a smaller feature map which is then split into patches and fed into the ViT. The architecture is already described in the original ViT paper (https://arxiv.org/pdf/2010.11929.pdf) in section 3.1 |
Beta Was this translation helpful? Give feedback.
Hi, sorry for the late reply, I didn't see the question.
The backbone is just a rather simple ResNet as feature extractor (basically just a couple of conv layers with residual connections). The input image is fed into this CNN. The output is a smaller feature map which is then split into patches and fed into the ViT. The architecture is already described in the original ViT paper (https://arxiv.org/pdf/2010.11929.pdf) in section 3.1
The second image (https://production-media.paperswithcode.com/methods/Screen_Shot_2020-07-20_at_9.17.39_PM_ZHS2kmV.png) looks very similar to the architecture.