Understanding Masks in Transformer

So, I have a time series data, where my input sequences are of different lengths. For example, suppose I have a batch of three sequences of sizes [400, 39], [500,49], [600,39]. I want to use a transformer.Encoder model for training. Thus, to feed into the model I will pad the sequences to size 600 and now my batch shape would be of shape: [600, 3, 39] (here 39 is like embedding size). Next, I want the model to train in such a way that ith position should be able to attend to both the before and the after time instants but not the ones which are padded. Now, I guess this is where the src_key_padding_mask comes into the picture. Also, I do not want the padded ones(the ones which I padded to make it to 600) to affect any of the values and also not calculate any attention as well. So, is the src_mask intended for this? So, do I need to use src_mask for this?
I also wanted to know, whether I should initialize the padded values to 0 so that it works this way or any number is fine, that is for the first sequence which is of length 400 I would have to add 200 more of [1,39] vectors to make it to size 600. Here, do I have to add all zeros?

1 Like