Hi everyone,
I am getting the error in the title after 10 epochs of training.
If I run my code without Anomaly Detection, I get NaN’s in my data.
Running my code with Anomaly Detection enabled gives me the following output:
File "/home/matteo/Code/PoseRefiner0/main.py", line 90, in <module>
def main(_run, n_epochs, learning_rate, training_batch_size, perform_node_updates, optimizer_function, use_lr_scheduler,
File "/home/matteo/.conda/envs/PoseRefiner0/lib/python3.8/site-packages/sacred/experiment.py", line 190, in automain
self.run_commandline()
File "/home/matteo/.conda/envs/PoseRefiner0/lib/python3.8/site-packages/sacred/experiment.py", line 312, in run_commandline
return self.run(
File "/home/matteo/.conda/envs/PoseRefiner0/lib/python3.8/site-packages/sacred/experiment.py", line 276, in run
run()
File "/home/matteo/.conda/envs/PoseRefiner0/lib/python3.8/site-packages/sacred/run.py", line 238, in __call__
self.result = self.main_function(*args)
File "/home/matteo/.conda/envs/PoseRefiner0/lib/python3.8/site-packages/sacred/config/captured_function.py", line 42, in captured_function
result = wrapped(*args, **kwargs)
File "/home/matteo/Code/PoseRefiner0/main.py", line 148, in main
output = model(data)
File "/home/matteo/.conda/envs/PoseRefiner0/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/matteo/Code/PoseRefiner0/pose_refiner.py", line 41, in forward
x, edge_attr = self.msg_passer(x=x, # NOTE: This is different on purpose.
File "/home/matteo/.conda/envs/PoseRefiner0/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/matteo/Code/PoseRefiner0/msg_passing.py", line 49, in forward
updated_edge_attr = self.update_edges(x=x, edge_index=edge_index, edge_attr=edge_attr,
File "/home/matteo/Code/PoseRefiner0/msg_passing.py", line 157, in update_edges
updated_edge_attr[edge_id, :] = torch.mean(single_updates[ranges_for_averaging_EDGES[edge_id, ranges_for_averaging_EDGES[edge_id, :] != -1]], dim=0)
(function print_stack)
ERROR - PoseRefiner-poses - Failed after 7:22:01!
Traceback (most recent calls WITHOUT Sacred internals):
File "/home/matteo/Code/PoseRefiner0/main.py", line 158, in main
total_loss.backward()
File "/home/matteo/.conda/envs/PoseRefiner0/lib/python3.8/site-packages/torch/tensor.py", line 185, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/home/matteo/.conda/envs/PoseRefiner0/lib/python3.8/site-packages/torch/autograd/__init__.py", line 125, in backward
Variable._execution_engine.run_backward(
RuntimeError: Function 'torch::autograd::CopySlices' returned nan values in its 0th output.
It seems that the following line is the culprit, and that while performing backpropagation something goes wrong:
updated_edge_attr[edge_id, :] = torch.mean(single_updates[ranges_for_averaging_EDGES[edge_id, ranges_for_averaging_EDGES[edge_id, :] != -1]], dim=0)
My hypotheses on what could be going wrong:
- I might be using some incorrect values in my indices, possibly computing the average of an empty set. So then backprop would fail.
- I might be computing an average of very large values. While summing the input values I could get an overflow, ruining the gradient.
Does anyone have a clue on what is happening or on how can I debug the problem further?
Thank you in advance!