RuntimeError: Function 'torch::autograd::CopySlices' returned nan values in its 0th output

MatteoTaiana · February 3, 2021, 5:22pm

Hi everyone,
I am getting the error in the title after 10 epochs of training.

If I run my code without Anomaly Detection, I get NaN’s in my data.

Running my code with Anomaly Detection enabled gives me the following output:

  File "/home/matteo/Code/PoseRefiner0/main.py", line 90, in <module>
    def main(_run, n_epochs, learning_rate, training_batch_size, perform_node_updates, optimizer_function, use_lr_scheduler,
  File "/home/matteo/.conda/envs/PoseRefiner0/lib/python3.8/site-packages/sacred/experiment.py", line 190, in automain
    self.run_commandline()
  File "/home/matteo/.conda/envs/PoseRefiner0/lib/python3.8/site-packages/sacred/experiment.py", line 312, in run_commandline
    return self.run(
  File "/home/matteo/.conda/envs/PoseRefiner0/lib/python3.8/site-packages/sacred/experiment.py", line 276, in run
    run()
  File "/home/matteo/.conda/envs/PoseRefiner0/lib/python3.8/site-packages/sacred/run.py", line 238, in __call__
    self.result = self.main_function(*args)
  File "/home/matteo/.conda/envs/PoseRefiner0/lib/python3.8/site-packages/sacred/config/captured_function.py", line 42, in captured_function
    result = wrapped(*args, **kwargs)
  File "/home/matteo/Code/PoseRefiner0/main.py", line 148, in main
    output = model(data)
  File "/home/matteo/.conda/envs/PoseRefiner0/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/matteo/Code/PoseRefiner0/pose_refiner.py", line 41, in forward
    x, edge_attr = self.msg_passer(x=x,                  # NOTE: This is different on purpose.
  File "/home/matteo/.conda/envs/PoseRefiner0/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/matteo/Code/PoseRefiner0/msg_passing.py", line 49, in forward
    updated_edge_attr = self.update_edges(x=x, edge_index=edge_index, edge_attr=edge_attr,
  File "/home/matteo/Code/PoseRefiner0/msg_passing.py", line 157, in update_edges
    updated_edge_attr[edge_id, :] = torch.mean(single_updates[ranges_for_averaging_EDGES[edge_id, ranges_for_averaging_EDGES[edge_id, :] != -1]], dim=0)
 (function print_stack)
ERROR - PoseRefiner-poses - Failed after 7:22:01!
Traceback (most recent calls WITHOUT Sacred internals):
  File "/home/matteo/Code/PoseRefiner0/main.py", line 158, in main
    total_loss.backward()
  File "/home/matteo/.conda/envs/PoseRefiner0/lib/python3.8/site-packages/torch/tensor.py", line 185, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/matteo/.conda/envs/PoseRefiner0/lib/python3.8/site-packages/torch/autograd/__init__.py", line 125, in backward
    Variable._execution_engine.run_backward(
RuntimeError: Function 'torch::autograd::CopySlices' returned nan values in its 0th output.

It seems that the following line is the culprit, and that while performing backpropagation something goes wrong:

updated_edge_attr[edge_id, :] = torch.mean(single_updates[ranges_for_averaging_EDGES[edge_id, ranges_for_averaging_EDGES[edge_id, :] != -1]], dim=0)

My hypotheses on what could be going wrong:

I might be using some incorrect values in my indices, possibly computing the average of an empty set. So then backprop would fail.
I might be computing an average of very large values. While summing the input values I could get an overflow, ruining the gradient.

Does anyone have a clue on what is happening or on how can I debug the problem further?

Thank you in advance!