Question about masked MultiheadAttention


I am training a ViT Encoder. For each input, I have two masks: First mask shows the padding area and the second mask is used to help the attention module attend more or less to some specific parts.

Input: (1, 706, 384) → 1 : batch size, 706: number of patch embeddings, 385: embedding dimensions
padding_mask: (1, 706) : binary values to show with patches correspond the padding
input_mask: (1, 706): float values [0-1] to show important patches

How can I define a self attention layer in pytorch?
should I create one mask( including 0 for padding and a float value for importance) and give to key_padding_mask? Or do I need to use attn_mask?

Thank you in advance.