Apply mask softmax

Good question, I am also curious about this. My intuition is given enough model capacity (hidden feature size) the model should still be able to learn.

Is there any benefit in using masked_fill instead of doing:
vec[(1 - mask).bool()] = float('-inf') F.softmax(vec, dim=1)
besides that it has fewer lines?

Many thanks for the answer!
However according to the question I think the mask step should be reversed like:

A_exp = A_exp * (A != 0).type(torch.FloatTensor) # this step masks

Right?

And, what is the purpose of calculating the exp of A-A_max instead of A? (since the final answer is identical.)

Theoretically it’s okay but this may cause unintended NANs (for now), as discussed in:

The most common way is to replace PAD (0) with a small number (-1e9) before softmax. For example,

attn_mask = input == 0
scores.masked_fill(attn_mask, -1e9)
attention = F.softmax(scores, dim=1)

You do have to use “where” to remove the 0.

2 Likes

Thanks for sharing the code! However, the reason for doing the $e^{A-A.max()}$ trick is numeric stability. In your version, directly doing A.exp() / A.exp().sum(dim) will cause overflow in some cases.

Thanks for sharing your code! Tiny typo, should be inplace masked_fill_ :slight_smile:

I think we need to include the epsilon, because sometimes you have an input with all elements are masked