Proper way to avoid divide by 0 in custom loss function

hankdikeman · May 15, 2021, 10:58pm

I have a custom loss function defined and I hit a wall debugging it. It is designed to return loss that is scaled according to the output value:

import torch
from torch import nn
import torch.nn.functional as F

class ConditionalMeanRelativeLoss(nn.Module):
    def __init__(self):
        super(ConditionalMeanRelativeLoss, self).__init__()
    
    def forward(self, output, target):
        # calculate absolute errors
        absolute_errors = torch.abs(torch.subtract(output, target))
        # where target is too small, use just the absolute errors to avoid divide by 0
        loss = torch.where(torch.abs(target) < 0.001, absolute_errors, torch.abs(torch.divide(absolute_errors, target)))
        # return mean loss
        return torch.mean(loss)

I was conscious that I might create a divide by 0 error, so I use a “where” to try to avoid it. This is the first custom loss function I have ever defined, and when I use it, it returns all nan values. I used the torch anomaly detection and I saw this error:

/opt/miniconda3/envs/torch/lib/python3.8/site-packages/torch/autograd/__init__.py:145: UserWarning: Error detected in DivBackward0. Traceback of forward call that caused the error:
  File "HybridMethodConfig1.py", line 322, in <module>
    loss = train_model(deriv, derivtrainloader, DE_loss_fn, DE_optim, DEVICE)
  File "/Users/henrydikeman/github/CombustTorch/auto_ode/ModelUtilities.py", line 44, in train_model
    batch_loss = loss_fn(predictions, batch_results)
  File "/opt/miniconda3/envs/torch/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/Users/henrydikeman/github/CombustTorch/auto_ode/CustomLossFunctions.py", line 17, in forward
    loss = torch.where(torch.abs(target) < 0.005, absolute_errors, torch.abs(torch.divide(absolute_errors, target)))
 (Triggered internally at  /Users/distiller/project/conda/conda-bld/pytorch_1614389903258/work/torch/csrc/autograd/python_anomaly_mode.cpp:104.)
  Variable._execution_engine.run_backward(
  0%|                                                                      | 0/1483 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "HybridMethodConfig1.py", line 322, in <module>
    loss = train_model(deriv, derivtrainloader, DE_loss_fn, DE_optim, DEVICE)
  File "/Users/henrydikeman/github/CombustTorch/auto_ode/ModelUtilities.py", line 47, in train_model
    batch_loss.backward()
  File "/opt/miniconda3/envs/torch/lib/python3.8/site-packages/torch/tensor.py", line 245, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/opt/miniconda3/envs/torch/lib/python3.8/site-packages/torch/autograd/__init__.py", line 145, in backward
    Variable._execution_engine.run_backward(
RuntimeError: Function 'DivBackward0' returned nan values in its 0th output.

Before this I have been using the built-in MSE loss, so I just subbed out the function and treated it as a drop-in replacement. I was fairly sure that torch.where is reverse differentiable, but then again I am not totally sure. You can see I kind of went overboard with the torch operations trying to track down the issue.

I am 1000% sure my code worked before with MSE loss, so unless I need to treat this function different than MSE loss my code besides this should be good.

Edit: I tried while taking out the line with “torch.where” and it worked. So I guess I’m asking if there is any way I can get this elementwise conditional logic to work.

KFrank · May 17, 2021, 1:40am

Hi Henry!

It looks like your issue is due to a troublesome bug in the innards of
autograd – not specific to torch.where(), but in lower-level infrastructure.

However, in your use case, you can work around it by clamping the
denominator of your potential divide-by-zero away from zero. Here
is an illustrative script that contains a modified version of your custom
loss function:

import torch
from torch import nn
import torch.nn.functional as F

print ('torch.__version__', torch.__version__)

torch.manual_seed (2021)

class ConditionalMeanRelativeLoss(nn.Module):
    def __init__(self):
        super(ConditionalMeanRelativeLoss, self).__init__()
    
    def forward(self, output, target):
        # calculate absolute errors
        absolute_errors = torch.abs(torch.subtract(output, target))
        # where target is too small, use just the absolute errors to avoid divide by 0
        loss = torch.where(torch.abs(target) < 0.001, absolute_errors, torch.abs(torch.divide(absolute_errors, target)))
        print ('pre-mean loss =', loss)
        # return mean loss
        return torch.mean(loss)
    

class ConditionalMeanRelativeLossB(nn.Module):
    def __init__(self):
        super(ConditionalMeanRelativeLossB, self).__init__()
    
    def forward(self, output, target):
        # calculate absolute errors
        absolute_errors = torch.abs(torch.subtract(output, target))
        # where target is too small, use just the absolute errors to avoid divide by 0
        # but clamp abs (target) away from zero to avoid "ghost" divide by 0
        abs_target = torch.abs (target).clamp (0.0005)
        loss = torch.where(abs_target < 0.001, absolute_errors, torch.divide(absolute_errors, abs_target))
        print ('pre-mean loss (B) =', loss)
        # return mean loss
        return torch.mean(loss)
    

outputA = torch.randn (5)
outputB = outputA.clone()
outputA.requires_grad = True
outputB.requires_grad = True
target = torch.randn (5)
target[2] = 0.0
target[3] = 0.0

print ('outputA =', outputA)
print ('outputB =', outputB)
print ('target =', target)

ConditionalMeanRelativeLoss() (outputA, target).backward()
print ('outputA.grad  =', outputA.grad)

ConditionalMeanRelativeLossB() (outputB, target).backward()
print ('outputB.grad  =', outputB.grad)

And here is its output:

torch.__version__ 1.7.1
outputA = tensor([ 2.2871,  0.6413, -0.8615, -0.3649, -0.6931], requires_grad=True)
outputB = tensor([ 2.2871,  0.6413, -0.8615, -0.3649, -0.6931], requires_grad=True)
target = tensor([ 0.9023, -2.7183,  0.0000,  0.0000,  0.4822])
pre-mean loss = tensor([1.5346, 1.2359, 0.8615, 0.3649, 2.4375], grad_fn=<SWhereBackward>)
outputA.grad  = tensor([ 0.2216,  0.0736,     nan,     nan, -0.4148])
pre-mean loss (B) = tensor([1.5346, 1.2359, 0.8615, 0.3649, 2.4375], grad_fn=<SWhereBackward>)
outputB.grad  = tensor([ 0.2216,  0.0736, -0.2000, -0.2000, -0.4148])

As to the autograd bug: A cluster of github issues shows that this is a
known problem. I don’t understand the details, but some of the comments
suggest that this bug might be tricky to fix, and perhaps won’t get fixed.

But I think (probably in general, not just in your use case) that if you
understand what is going on, you can work around it.

Here are a few of the relevant github issues:

github.com/pytorch/pytorch

backprop through torch.where backprops nans through path that was not taken

opened 02:54PM - 20 Apr 20 UTC

closed 04:09PM - 20 Apr 20 UTC

jonasrauber

I think backprop through `torch.where` is wrong in certain special cases. ## …To Reproduce ```python t = 0 x = torch.ones(()).requires_grad_() y = t * (x / t) # just an example; anything that produces nan's works z = torch.where(x >= t, x, y) z.backward() # the forward pass works fine (the `nan`'s in `y` do not affect z) # NOTE: this is unlike a naive implement of where that does `cond * x + (1 - cond) * y` print(z) # tensor(1., grad_fn=<SWhereBackward>) # but the backward pass backprops the `nan`'s from y into x, even though the y path is never taken in torch.where print(x.grad) # tensor(nan) ``` ## Expected behavior ```python print(x.grad) # tensor(1.) ``` This would be the correct gradient. In practice, this bug can easily happen if one runs the above code for different t (including 0) and assumes that the `nan`'s for `t = 0` should not matter because the `torch.where` always selects the first path that has no `nan`s (and the forward pass does handle it correctly). ## Environment - PyTorch Version 1.4

github.com/pytorch/pytorch

Incorrect gradients for torch.where when one of the target tensors contains inf/nan

opened 07:40PM - 25 Jul 19 UTC

closed 08:00PM - 25 Jul 19 UTC

egrefen

## 🐛 Bug The `grad_fn` of `torch.where` returns the gradients of the wrong ar…gument, rather than of the selected tensor, if the other tensor's gradients have infs or nans. ## To Reproduce Run this code: ```python x = torch.tensor([16., 0.], requires_grad=True) y = x/2 # tensor([8., 0.], grad_fn=<DivBackward0>) z = x.sqrt() + 1 # tensor([5., 1.], grad_fn=<SqrtBackward>) # Calculate dy/dx, dz/dx dydx = torch.autograd.grad(y.sum(), x, retain_graph=True)[0] # tensor([0.5000, 0.5000]) dzdx = torch.autograd.grad(z.sum(), x, retain_graph=True)[0] # tensor([0.1250, inf]) # Define w = [w0, w1] == [y0, z1] w = torch.where(x == 0., y, z) # tensor([5., 0.], grad_fn=<SWhereBackward>) expected_dw_dx = torch.where(x == 0., dydx, dzdx) # tensor([0.1250, 0.5000]) dwdx = torch.autograd.grad(w.sum(), x, retain_graph=True)[0] # is actually tensor([0.1250, inf]) print("`torch.where` communicates gradients correctly:", torch.equal(expected_dw_dx, dwdx)) ``` ## Expected behavior I would expect `expected_dw_dx == dwdx` in the example above. ## Environment Please copy and paste the output from our [environment collection script](https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py) (or fill out the checklist below manually). You can get the script and run it with: ``` wget https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py # For security purposes, please check the contents of collect_env.py before running it. python collect_env.py ``` PyTorch version: 1.1.0 Is debug build: No CUDA used to build PyTorch: 9.0.176 OS: Ubuntu 18.04.1 LTS GCC version: (Ubuntu 7.3.0-27ubuntu1~18.04) 7.3.0 CMake version: version 3.10.2 Python version: 3.7 Is CUDA available: Yes CUDA runtime version: 9.0.176 GPU models and configuration: GPU 0: Quadro GP100 GPU 1: Quadro GP100 Nvidia driver version: 410.79 cuDNN version: Could not collect Versions of relevant libraries: [pip] numpy==1.16.4 [pip] torch==1.1.0 [pip] torchvision==0.3.0 [conda] torch 1.1.0 <pip> [conda] torchvision 0.3.0 <pip>

github.com/pytorch/pytorch

Incorrect NaN gradient from distribution.Normal.log_prob when using subset

opened 07:29PM - 12 Dec 18 UTC

closed 10:47PM - 13 Dec 18 UTC

samedii

## 🐛 Bug When a subset of a log_prob tensor is NaN then you can select the subs…et that is not NaN. This should then result in a finite gradient (in many cases) when doing backprop through the distribution. It does not ## To Reproduce Steps to reproduce the behavior: ``` import numpy as np import torch import torch.nn as nn import torch.distributions as dist x = torch.tensor([1.0,2,3,np.nan]) y = torch.tensor([1.0,2,3,4]) k = nn.Parameter(0.01*torch.randn(1)) d = dist.Normal(loc=k*x, scale=1) log_prob = d.log_prob(y) print(log_prob) # tensor([-1.4253, -2.9444, -5.4763, nan], grad_fn=<SubBackward>) loss = -log_prob[:-1].mean() print(loss) # tensor(3.2820, grad_fn=<NegBackward>) loss.backward() print(k.grad) # tensor([nan]) ``` ## Expected behavior Should see a finite gradient. ## Environment PyTorch version: 0.4.1 Is debug build: No CUDA used to build PyTorch: 9.0 OS: Microsoft Windows 7 Enterprise GCC version: Could not collect CMake version: Could not collect Python version: 3.6 Is CUDA available: No CUDA runtime version: No CUDA GPU models and configuration: No CUDA Nvidia driver version: No CUDA cuDNN version: No CUDA Versions of relevant libraries: [pip] numpy (1.13.3) [pip] numpydoc (0.7.0) [conda] mkl 2018.0.0 h36b65af_4 [conda] mkl-service 1.1.2 py36h57e144c_4

Best.

K. Frank