I am trying to train a RBM in a discriminative way. The forward of the net compute the log-conditional probabilities. The normalization I need to perform in order to get the probabilities, however, does not involve a softmax (hence, I cannot use F.log_softmax) (see DRBM paper, p(y|x), at page 2).
In the problem I’m trying to solve, it is possible to have 0 probabilities. Let
x be the binary input,
x[i] = 0 implies
p(y=i|x) = 0.
This is usually not a problem when using
F.nll_loss(logits, target).backward(), since the 0 probability class will never be a target. Nevertheless, this seems to be a problem when using
I report here an example to reproduce the error:
w = torch.tensor([1.,2.,3.], requires_grad=True) x = torch.tensor([[1,0,1]]) c = torch.tensor() logits = w*x
If I had to normalize using softmax, Then I could use
l_mod = logits.clone() l_mod[l_mod==0] = float("-inf") print(F.log_softmax(l_mod, dim=-1)) # out: tensor([[-2.1269, -inf, -0.1269]], grad_fn=<LogSoftmaxBackward>) F.nll_loss(F.log_softmax(l_mod, dim=-1), c).backward() #or F.cross_entropy(...) print(w.grad) # out: tensor([-0.8808, 0.0000, 0.8808])
If, instead, I normalize without using the softmax, I get:
print(torch.log(logits/logits.sum(-1))) #out: tensor([[-1.3863, -inf, -0.2877]], grad_fn=<LogBackward>) F.nll_loss(-torch.log(logits/logits.sum(-1)), c).backward() print(w.grad) #out: tensor([nan, nan, nan])
For the moment I solved the problem by either
(logits\logits.sum(-1)).clamp(min=1e-16)before taking the log
- Taking the log only of positive probabilities.:
probs = logits\logits.sum(-1) return torch.log(probs[probs>0])
However my question remains: Why do I get nan without using such tricks? Are there better ways to avoid this problem from happening?
Thank you in advance!