Accessing attention weights for ViT models


How can I get the attention weights of the available pre-trained ViT (In my case im using the vit_b_16)

I ultimately want to visualize an attention map over the original input image.

I’ve read the source code for the Vision transformer class and from what I understand we explicitly set the need_weights flag to False at the Encoder block.