Understanding "mask" dtype from TransformerEncoder forward

Dear all,

I am working with TransformerEncoder module. I understood the “mask” parameter from the forward function can be a boolean tensor where True indicates that it is forbidden to attend the token and False is the inverse.

In my case, I am working with sequence-to-sequece mask. Initially, I was converting my mask to float with the following code

mask = mask.float().masked_fill(mask == 0, float('-inf')).masked_fill(mask == 1, float(0.0))

However, I realized it takes lots of time (~0.60 seconds) when length of context is > 50. So I removed the float conversion and kept boolean values and it takes only ~0.06 seconds for equivalent context length.

Note that my masks shape are (batch_size, T, T) because each sample from the batch has have different mask.

My question is : what is the key difference between a boolean and float mask ? Do they give the same results ?

Thanks !

You might run into this limitation disallowing the fast path if floating point masks are used.

1 Like