Help debugging custom loss

Hi, i am working with a custom loss for a regression problem, my code looks like this:

class WeightedRMSELoss(torch.nn.Module):
    def __init__(self, Wtp=2, Wfp=1.5, Wtn=.5, Ts=0.5):
        super(WeightedRMSELoss, self).__init__()
        self.Wtp = Wtp
        self.Wfp = Wfp
        self.Wtn = Wtn
        self.Ts = Ts
        self.criterion = torch.nn.MSELoss(reduction='none')

    def forward(self, x, y):

        x = x.reshape(-1)
        y = y.reshape(-1)

        below_threshold_mask = y <= self.Ts
        above_threshold_mask = ~below_threshold_mask
        t1 = torch.where(below_threshold_mask & (x <= self.Ts), self.Wtn, torch.tensor(0.0))
        t2 = torch.where(below_threshold_mask & (x > self.Ts), self.Wfp, torch.tensor(0.0))
        t3 = torch.where(above_threshold_mask, self.Wtn, torch.tensor(0.0))
        loss = torch.sqrt(self.criterion(x, y))
        weighted_losses = t1 * loss + t2 * loss + t3 * loss
        sample_losses = weighted_losses.sum() / weighted_losses.shape[0]
        return sample_losses

However, few epochs into training my loss starts to always be nan. After looking into the function it seems that on some elements all t1, t2 and t3 have zero values leading to a nan loss, but i am not understanding why this is happening? Do you have any ideas? Thanks a lot!

Hi Gustar!

I don’t think that zero values for t1, t2, and t3 will lead to your loss function
returning nan.

Check that the inputs to your loss function haven’t become nan. Depending
on the details of your model and how you train, your training can diverge,
causing model parameters to become nan and therefore the input to your
loss function to become nan.

Try optimizing with plain-vanilla SGD with a small learning rate to see if you
can get non-divergent (even if slow) training.

Also check that you aren’t passing any nans or infs as data into your model.

Good luck

K. Frank

Hi Frank, first of all thank you very much for your reply! I think you are right about divergence, with SGD i managed to make it run for 12/13 epochs but it stills output nan after. What i am not understanding is why the loss seems to converge in theese epochs (from 0.43 on valid to 0.29) and then all of a sudden it becomes nan. Also using a simplified version of the loss like this:

class BinaryWeightedRMSELoss(torch.nn.Module):
    def __init__(self, Ts=0.4):
        self.Ts = Ts
        super(BinaryWeightedRMSELoss, self).__init__()

    def forward(self, x, y):
        y_pred = x.reshape(-1)
        y_true = y.reshape(-1)
        mse_loss = torch.nn.MSELoss(reduction='none')(y_pred, y_true)
        weight = torch.where(y_true < self.Ts, torch.tensor(0.8), torch.tensor(1.2))
        custom_weighted_loss = weight * mse_loss
        return torch.sqrt(torch.mean(custom_weighted_loss))

Works just fine. I am working on a multi output regression problem with lots of low values (between 0-0.5) and few high value (0.5 - 50). Since i am interested only in forecasting peaks, the idea behind my loss was to reduce the weight of “true negative” and increase the weight of “true positive” but it would be great to include false negative and false positive in the evaluation. Thanks again


Hi Gustar!

You can use with autograd.detect_anomaly(): to help track this down (if you
care). This will raise an error as soon as the backward pass produces a nan.
That gives a starting point from which to try to find the root cause of the nan.


K. Frank