Hi, i am working with a custom loss for a regression problem, my code looks like this:
class WeightedRMSELoss(torch.nn.Module):
def __init__(self, Wtp=2, Wfp=1.5, Wtn=.5, Ts=0.5):
super(WeightedRMSELoss, self).__init__()
self.Wtp = Wtp
self.Wfp = Wfp
self.Wtn = Wtn
self.Ts = Ts
self.criterion = torch.nn.MSELoss(reduction='none')
def forward(self, x, y):
x = x.reshape(-1)
y = y.reshape(-1)
below_threshold_mask = y <= self.Ts
above_threshold_mask = ~below_threshold_mask
t1 = torch.where(below_threshold_mask & (x <= self.Ts), self.Wtn, torch.tensor(0.0))
t2 = torch.where(below_threshold_mask & (x > self.Ts), self.Wfp, torch.tensor(0.0))
t3 = torch.where(above_threshold_mask, self.Wtn, torch.tensor(0.0))
loss = torch.sqrt(self.criterion(x, y))
weighted_losses = t1 * loss + t2 * loss + t3 * loss
sample_losses = weighted_losses.sum() / weighted_losses.shape[0]
return sample_losses
However, few epochs into training my loss starts to always be nan. After looking into the function it seems that on some elements all t1, t2 and t3 have zero values leading to a nan loss, but i am not understanding why this is happening? Do you have any ideas? Thanks a lot!
I don’t think that zero values for t1, t2, and t3 will lead to your loss function
returning nan.
Check that the inputs to your loss function haven’t become nan. Depending
on the details of your model and how you train, your training can diverge,
causing model parameters to become nan and therefore the input to your
loss function to become nan.
Try optimizing with plain-vanilla SGD with a small learning rate to see if you
can get non-divergent (even if slow) training.
Also check that you aren’t passing any nans or infs as data into your model.
Hi Frank, first of all thank you very much for your reply! I think you are right about divergence, with SGD i managed to make it run for 12/13 epochs but it stills output nan after. What i am not understanding is why the loss seems to converge in theese epochs (from 0.43 on valid to 0.29) and then all of a sudden it becomes nan. Also using a simplified version of the loss like this:
Works just fine. I am working on a multi output regression problem with lots of low values (between 0-0.5) and few high value (0.5 - 50). Since i am interested only in forecasting peaks, the idea behind my loss was to reduce the weight of “true negative” and increase the weight of “true positive” but it would be great to include false negative and false positive in the evaluation. Thanks again
You can use with autograd.detect_anomaly(): to help track this down (if you
care). This will raise an error as soon as the backward pass produces a nan.
That gives a starting point from which to try to find the root cause of the nan.