Stacking a couple of resnet blocks each with a self-attention module

Imagine I want to feed an image to a couple of stacked resnet blocks. How should I stack the resnet blocks in code and how can I attach self-attention modules to them? Can you please suggest to me some related code that might look like this architecture?

These are natural images.
I am open to using either resnet18 or resnet50.

Also at the very end, how does the final embedding combine all of these?

@ptrblck I think i want to do something like this except I am unsure how to add self-attention to intermediate resnet blocks

also, for self-attention, there are so many options. Is there any self-built module within pytorch that you would suggest?
I found these alternatives, what is your take?

  1. Attention in image classification - #3 by AdilZouitine
  2. GitHub - Chenglin-Yang/LESA: Locally Enhanced Self-Attention: Rethinking Self-Attention as Local and Context Terms
  3. GitHub - leaderj1001/Stand-Alone-Self-Attention: Implementing Stand-Alone Self-Attention in Vision Models using Pytorch