Hello everyone!
I am trying to train a RBM in a discriminative way. The forward of the net compute the log-conditional probabilities. The normalization I need to perform in order to get the probabilities, however, does not involve a softmax (hence, I cannot use F.log_softmax) (see DRBM paper, p(y|x), at page 2).
In the problem I’m trying to solve, it is possible to have 0 probabilities. Let x
be the binary input, x[i] = 0
implies p(y=i|x) = 0
.
This is usually not a problem when using F.nll_loss(logits, target).backward()
, since the 0 probability class will never be a target. Nevertheless, this seems to be a problem when using torch.log(logits/logits.sum(-1))
.
I report here an example to reproduce the error:
w = torch.tensor([1.,2.,3.], requires_grad=True)
x = torch.tensor([[1,0,1]])
c = torch.tensor([0])
logits = w*x
If I had to normalize using softmax, Then I could use
l_mod = logits.clone()
l_mod[l_mod==0] = float("-inf")
print(F.log_softmax(l_mod, dim=-1))
# out: tensor([[-2.1269, -inf, -0.1269]], grad_fn=<LogSoftmaxBackward>)
F.nll_loss(F.log_softmax(l_mod, dim=-1), c).backward() #or F.cross_entropy(...)
print(w.grad)
# out: tensor([-0.8808, 0.0000, 0.8808])
If, instead, I normalize without using the softmax, I get:
print(torch.log(logits/logits.sum(-1)))
#out: tensor([[-1.3863, -inf, -0.2877]], grad_fn=<LogBackward>)
F.nll_loss(-torch.log(logits/logits.sum(-1)), c).backward()
print(w.grad)
#out: tensor([nan, nan, nan])
For the moment I solved the problem by either
- using
(logits\logits.sum(-1)).clamp(min=1e-16)
before taking the log - Taking the log only of positive probabilities.:
probs = logits\logits.sum(-1) return torch.log(probs[probs>0])
However my question remains: Why do I get nan without using such tricks? Are there better ways to avoid this problem from happening?
Thank you in advance!