More stable softmax with temperature

haorannlp · January 6, 2021, 9:47am

I wrote a seq2seq model and tried to implement minimum risk training (Eq. (13) in the paper: Minimum Risk Training for Neural Machine Translation)
I added
torch.autograd.set_detect_anomaly(True) at the beginning of the model.
It outputed an error

RuntimeError: Function 'ExpBackward' returned nan values in its 0th output.

According to the tracceback, it has sth to do with the 2nd line of code below:
seq_nll = seq_nll-torch.max(seq_nll, dim=-1)[0].unsqueeze(1)
seq_probs = torch.pow(torch.exp(seq_nll), 0.005)
normalizer = torch.sum(seq_probs, dim=-1).view(-1, 1)
seq_nll is a tensor of shape (64,3) with very negative numbers like [ -94.5122, -50.0515, -76.2685].
These numbers are log likelihood of different sequences.
The exp operation is to obtain the probability of those sequences.
The power operation is to scale the probs.
The normalizer is the sum of those re-scaled probs.

I guess the problem here is related to those very negative numbers.
Is there a stable way to implemente the above code?

Abhilash_Srivastava · January 6, 2021, 11:36am

If I understand it correctly:
torch.pow(torch.exp(seq_nll), 0.005) would simply imply torch.exp(seq_nll * 0.005)

So, now you can directly use torch.nn.functional.softmax, with seq_nll * 0.005 as the input.