Imagine I want to feed an image to a couple of stacked resnet blocks. How should I stack the resnet blocks in code and how can I attach self-attention modules to them? Can you please suggest to me some related code that might look like this architecture?
These are natural images.
I am open to using either resnet18 or resnet50.
Also at the very end, how does the final embedding combine all of these?