How to avoid nan in softmax?

I need to compute softmax for a two dimensional matrix w, batch * seq_length. Sequences have different length, and they are denoted by a mask matrix mask_d, also of size batch * seq_length.

I have written the following code, however, it runs into all nan after a couple of iterations. Is there a better way to implement this, or is there an existing SoftMax implementation in PyTorch that can handle a batch of sequences of variable length by mask, and is numerically stable?

w_max = torch.max(w, 1)[0]
w_max = w_max.expand_as(w)
w_max.data[w_max.data < 0] = 0
w = torch.exp(w - w_max)
w_sum = torch.sum(w * mask_d, 1)
w_sum = w_sum.expand_as(w)
w = w / w_sum * mask_d

you can use nn.LogSoftmax, it is numerically more stable and is less likely to nan than using Softmax

1 Like

But if I just want to get SoftMax instead of LogSoftMax, what should I do? And SoftMax do not allow me to do batch operations of variable sequence lengths, so I have to define my own softmax operations.

you can do Softmax, but that operation is inherently numerically unstable. That is why I suggested that you do LogSoftmax

1 Like

Thanks. But I think you misunderstand my question. I am working on a batch_size*max_sequence_length matrix, and the sequences are of variable lengths, that’s why I need a mask matrix to mask out some padding elements. Seems neither Softmax nor LogSoftmax supports this operation of masked softmax. And if I use LogSoftmax, should I do exp(w) to convert it back to Softmax, and do you mean that this will work?

1 Like

I guess w_sum * mask_d is zero in the last step. if you print it, you can find it out. Also I’m wondering why you do w_max.data[w_max.data < 0] = 0.
You may try smooth tricks, like add eps*seq_length in w_sum and eps in w. But I think it would better to find the cause of this problem in your model.

1 Like

Hi, did you find a solution?Is that possible to just add a very small number to the denominator?

1 Like