Nan gradient with MultiHeadAttention when passing a key mask

EtienneT · March 12, 2020, 12:46am

While using MultiHeadAttention, I got the following error message when using with autograd.detect_anomaly():

Function 'PowBackward0' returned nan values in its 0th output.

The error seemed related to the mask I was passing. I was calculating the mask like so:

mask = torch.arange(question_mids.shape[1]).repeat((question_mids.shape[0],1)).to(children_lengths.device) < children_lengths
mask = ~mask

question_mids = question_mids.permute(1,0,2)
question_mids, attention = self.attn(question_mids, question_mids, question_mids)

I finally figured out that the problem was related to ~mask and I simply changed the line before to avoid having to use the negation sign. But I was wondering why this negation might cause gradient problems.

Thanks,