How does masking work for Transformers? Are they really not seeing future values?

Hi everyone,

I was using a Transformer Encoder for predicting future values of a time series. As I wanted the transformer to take into account 200 past values and predicting the next 50, I was using a mask of (250x250) and the last 50 values of each row were -inf for hiding the future values. However, now I realized that when I put zeros on the last 50 values the Transformer crashes. I tried with the complete transformer (encoder + decoder) and the same thing happens. Does anyone know what can be happening or if I am doing something wrong?

Thanks in advanced to everyone

Update: Now I see that a triangle mask, as the one is generally used in NLP, works but a square mask, as the one I created does not. It works as if a triangle mask and the squared were multiplied, as the positions when any of the two mask are zero are used by the transformer. I thought that you can use a mask of the shape that you want, if anyone can help I would really appreciate it,

Thank you all in advanced