Transformer masks explanation?

Can somebody please point me to a tutorial with a clear explanation of what each of the TransformerEncoder/Decoder mask parameters do, and when should one use them?

Specifically,

  • Which mask should I use for invalid tokens in TransformerEncoder input?
  • Same for invalid tokens in TransformerDecoder input?
  • Which mask should I use to deal with invalid “memory” entries I need to pass to TransformerDecoder?

So far I tried src_key_padding_mask, tgt_key_padding_mask and memory_key_padding_mask respectively, but I am getting output tensors consisting entirely of NaNs.

Thanks!

This is the one I usually refer to: https://pytorch.org/tutorials/beginner/transformer_tutorial.html