Hey!
How can I get the attention weights of the available pre-trained ViT (In my case im using the vit_b_16
)
I ultimately want to visualize an attention map over the original input image.
I’ve read the source code for the Vision transformer class and from what I understand we explicitly set the need_weights
flag to False
at the Encoder block.