Question about how to use the result of detect_anomaly

KoalaSheep · November 12, 2019, 8:28am

Hi, everyone.
I’m having trouble with my custom loss due to Nan output after several epochs.
I found a similar topic and a possible solution:

I use it like this(GB is the name of my network which is a densenet169)

        for inputs, labels in dataloaders['train']:
            inputs = inputs.to(device)
            labels = labels.to(device).float()
            optimizers['GB'].zero_grad()
            with torch.set_grad_enabled(True):
                with torch.autograd.detect_anomaly():
                    outputs = GB(inputs)
                    loss = LOSS(outputs, labels)
#                    loss.register_hook(lambda grad : print(grad))
                    loss.backward()

detail of the traceback, it's too long to read so I try to make some summary below

/opt/conda/conda-bld/pytorch_1565272271120/work/torch/csrc/autograd/python_anomaly_mode.cpp:57: UserWarning: Traceback of forward call that caused the error:
File “/home/koalasheep/anaconda3/envs/torch/lib/python3.7/runpy.py”, line 193, in _run_module_as_main
“main”, mod_spec)
File “/home/koalasheep/anaconda3/envs/torch/lib/python3.7/runpy.py”, line 85, in _run_code
exec(code, run_globals)
File “/home/koalasheep/anaconda3/envs/torch/lib/python3.7/site-packages/spyder_kernels/console/main.py”, line 11, in
start.main()
File “/home/koalasheep/anaconda3/envs/torch/lib/python3.7/site-packages/spyder_kernels/console/start.py”, line 318, in main
kernel.start()
File “/home/koalasheep/anaconda3/envs/torch/lib/python3.7/site-packages/ipykernel/kernelapp.py”, line 563, in start
self.io_loop.start()
File “/home/koalasheep/anaconda3/envs/torch/lib/python3.7/site-packages/tornado/platform/asyncio.py”, line 148, in start
self.asyncio_loop.run_forever()
File “/home/koalasheep/anaconda3/envs/torch/lib/python3.7/asyncio/base_events.py”, line 534, in run_forever
self._run_once()
File “/home/koalasheep/anaconda3/envs/torch/lib/python3.7/asyncio/base_events.py”, line 1771, in _run_once
handle._run()
File “/home/koalasheep/anaconda3/envs/torch/lib/python3.7/asyncio/events.py”, line 88, in _run
self._context.run(self._callback, *self._args)
File “/home/koalasheep/anaconda3/envs/torch/lib/python3.7/site-packages/tornado/ioloop.py”, line 690, in
lambda f: self._run_callback(functools.partial(callback, future))
File “/home/koalasheep/anaconda3/envs/torch/lib/python3.7/site-packages/tornado/ioloop.py”, line 743, in _run_callback
ret = callback()
File “/home/koalasheep/anaconda3/envs/torch/lib/python3.7/site-packages/tornado/gen.py”, line 787, in inner
self.run()
File “/home/koalasheep/anaconda3/envs/torch/lib/python3.7/site-packages/tornado/gen.py”, line 748, in run
yielded = self.gen.send(value)
File “/home/koalasheep/anaconda3/envs/torch/lib/python3.7/site-packages/ipykernel/kernelbase.py”, line 365, in process_one
yield gen.maybe_future(dispatch(*args))
File “/home/koalasheep/anaconda3/envs/torch/lib/python3.7/site-packages/tornado/gen.py”, line 209, in wrapper
yielded = next(result)
File “/home/koalasheep/anaconda3/envs/torch/lib/python3.7/site-packages/ipykernel/kernelbase.py”, line 272, in dispatch_shell
yield gen.maybe_future(handler(stream, idents, msg))
File “/home/koalasheep/anaconda3/envs/torch/lib/python3.7/site-packages/tornado/gen.py”, line 209, in wrapper
yielded = next(result)
File “/home/koalasheep/anaconda3/envs/torch/lib/python3.7/site-packages/ipykernel/kernelbase.py”, line 542, in execute_request
user_expressions, allow_stdin,
File “/home/koalasheep/anaconda3/envs/torch/lib/python3.7/site-packages/tornado/gen.py”, line 209, in wrapper
yielded = next(result)
File “/home/koalasheep/anaconda3/envs/torch/lib/python3.7/site-packages/ipykernel/ipkernel.py”, line 294, in do_execute
res = shell.run_cell(code, store_history=store_history, silent=silent)
File “/home/koalasheep/anaconda3/envs/torch/lib/python3.7/site-packages/ipykernel/zmqshell.py”, line 536, in run_cell
return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
File “/home/koalasheep/anaconda3/envs/torch/lib/python3.7/site-packages/IPython/core/interactiveshell.py”, line 2855, in run_cell
raw_cell, store_history, silent, shell_futures)
File “/home/koalasheep/anaconda3/envs/torch/lib/python3.7/site-packages/IPython/core/interactiveshell.py”, line 2881, in run_cell
return runner(coro)
File “/home/koalasheep/anaconda3/envs/torch/lib/python3.7/site-packages/IPython/core/async_helpers.py”, line 68, in pseudo_sync_runner
coro.send(None)
File “/home/koalasheep/anaconda3/envs/torch/lib/python3.7/site-packages/IPython/core/interactiveshell.py”, line 3058, in run_cell_async
interactivity=interactivity, compiler=compiler, result=result)
File “/home/koalasheep/anaconda3/envs/torch/lib/python3.7/site-packages/IPython/core/interactiveshell.py”, line 3249, in run_ast_nodes
if (await self.run_code(code, result, async=asy)):
File “/home/koalasheep/anaconda3/envs/torch/lib/python3.7/site-packages/IPython/core/interactiveshell.py”, line 3326, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File “”, line 1, in
runfile(‘/home/koalasheep/YMY/pro2/EFF_512.py’, wdir=‘/home/koalasheep/YMY/pro2’)
File “/home/koalasheep/anaconda3/envs/torch/lib/python3.7/site-packages/spyder_kernels/customize/spydercustomize.py”, line 985, in runfile
exec_code(file_code, filename, namespace)
File “/home/koalasheep/anaconda3/envs/torch/lib/python3.7/site-packages/spyder_kernels/customize/spydercustomize.py”, line 891, in exec_code
exec(compile(code, filename, ‘exec’), namespace)
File “/home/koalasheep/YMY/pro2/EFF_512.py”, line 260, in
top(50, 12, 12, data_folder, 556, 512, CXR14_csv, train_val_list, test_list, device)
File “/home/koalasheep/YMY/pro2/EFF_512.py”, line 248, in top
train(GB, optimizers, NUM_EPOCH, dataloaders, device, dataset_sizes)
File “/home/koalasheep/YMY/pro2/EFF_512.py”, line 97, in train
loss = LOSS(outputs, labels)
File “/home/koalasheep/anaconda3/envs/torch/lib/python3.7/site-packages/torch/nn/modules/module.py”, line 547, in call
result = self.forward(*input, **kwargs)
File “/home/koalasheep/YMY/pro2/utils_pro2.py”, line 394, in forward
temp_label = (1 - (targets[…, i] * inputs[…, i] + (1 - targets[…, i]) * (1 - inputs[…,i]))) ** gamma * temp

Traceback (most recent call last):

File “/home/koalasheep/YMY/pro2/EFF_512.py”, line 260, in
top(50, 12, 12, data_folder, 556, 512, CXR14_csv, train_val_list, test_list, device_)

File “/home/koalasheep/YMY/pro2/EFF_512.py”, line 248, in top
train(GB, optimizers, NUM_EPOCH, dataloaders, device, dataset_sizes)

File “/home/koalasheep/YMY/pro2/EFF_512.py”, line 99, in train
loss.backward()

File “/home/koalasheep/anaconda3/envs/torch/lib/python3.7/site-packages/torch/tensor.py”, line 118, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)

File “/home/koalasheep/anaconda3/envs/torch/lib/python3.7/site-packages/torch/autograd/init.py”, line 93, in backward
allow_unreachable=True) # allow_unreachable flag

RuntimeError: Function ‘MulBackward0’ returned nan values in its 1th output.

traceback stops at loss.backward(), so I paste my custom loss here.

def weighted_BCE(sigmoid_x, targets):
    assert sigmoid_x.size() == targets.size()
    count_p = (targets == 1.0).sum() + 1
    count_n = (targets == 0.0).sum() + 1
    loss = -((targets * sigmoid_x.log()) * (1 / count_p.float())) - (((1 - targets) * (1 - sigmoid_x).log()) * (1 / count_n.float()))
    return loss.mean()

class WBCE(nn.Module): #this is my custom loss
    def __init__(self, weight = None, size_average = True):
        super(WBCE, self).__init__()
        
    def forward(self, inputs, targets, gamma = 0.5):
        assert inputs.size() == targets.size()
        L = inputs.size(1)
        loss = 0.0
        for i in range(L):
            temp = weighted_BCE(sigmoid_x = inputs[..., i], targets = targets[..., i])
            temp_label = (1 - (targets[..., i] * inputs[..., i] + (1 - targets[..., i]) * (1 - inputs[...,i]))) ** gamma * temp
# traceback stops here, at the calculation of temp_label.
            loss += temp_label
        loss = loss.mean()
        return loss

Here is the error message, Nan grad seems to be the reason.

RuntimeError: Function 'MulBackward0' returned nan values in its 1th output.

But I don’t know what to do next. My code is normal with nn.BCEloss(), and still runs several epochs with the custom loss before it outputs Nan.
It’s my first time to use a custom loss, and any advice will be helpful.

ptrblck · November 12, 2019, 5:43pm

Based on the output message I would assume targets * sigmoid_x.log() creates the NaN values.
In particular, could you check the values of sigmoid_x.log()?
If you are passing zeros in sigmoid_x, the log() operation will result in an -inf output.

KoalaSheep · November 13, 2019, 1:39am

Thank you for your advice.
Based on your suggestion, I make a change to check it.
Since sigmoid_x.log() only exists in the def of weighted_BCE(), I # the temp_label like this:

        for i in range(L):
            temp = weighted_BCE(sigmoid_x = inputs[..., i], targets = targets[..., i])
#            temp_label = (1 - (targets[..., i] * inputs[..., i] + (1 - targets[..., i]) * (1 - inputs[...,i]))) ** gamma * temp
#            loss += temp_label
            loss += temp

and all the rest remains the same.
Now the loss will only calculate weighted_BCE without calculating temp_label, and until now the network works well(around 7 epochs, I set 30 epochs so I will tell you the result later).
So I think sigmoid_x.log() may not be the reason, as the input is the output of nn.sigmoid().