I enabled anomaly detection and it indicated torch.stack
. I was using cuda and mixed precision training at the same time before. I think it had to do something with it.
I tried to create a simple problem but I cannot cause the error to appear again. However, in the bigger problem, I have fixed the issue by restructuring my code so I can set neighs
to all_neighbors[[indexes of local neighbors]]
.
This is the full output:
[W ..\torch\csrc\autograd\python_anomaly_mode.cpp:60] Warning: Error detected in StackBackward. Traceback of forward call that caused the error:
File ".\cora.py", line 127, in <module>
sub_graph_size=SUBGRPAH_SIZE)
File "C:\Users\josue\anaconda3\envs\LIGN\lib\site-packages\lign\train.py", line 126, in superv
out = base(full_graph, inp) if is_base_gcn else base(inp)
File "C:\Users\josue\anaconda3\envs\LIGN\lib\site-packages\torch\nn\modules\module.py", line 726, in _call_impl
result = self.forward(*input, **kwargs)
File "C:\Users\josue\anaconda3\envs\LIGN\lib\site-packages\lign\models\CORA.py", line 20, in forward
x = F.relu(self.unit3(g, x))
File "C:\Users\josue\anaconda3\envs\LIGN\lib\site-packages\torch\nn\modules\module.py", line 726, in _call_impl
result = self.forward(*input, **kwargs)
File "C:\Users\josue\anaconda3\envs\LIGN\lib\site-packages\lign\nn.py", line 19, in forward
g.push(func = self.aggregation, data = "__hidden__")
out = func(out)
File "C:\Users\josue\anaconda3\envs\LIGN\lib\site-packages\lign\utils\functions.py", line 71, in sum_tensors
return th.stack(neighs).sum(dim = 0)
(function print_stack)
Traceback (most recent call last):
File ".\cora.py", line 127, in <module>
sub_graph_size=SUBGRPAH_SIZE)
File "C:\Users\josue\anaconda3\envs\LIGN\lib\site-packages\lign\train.py", line 133, in superv
scaler.scale(loss).backward()
File "C:\Users\josue\anaconda3\envs\LIGN\lib\site-packages\torch\tensor.py", line 185, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "C:\Users\josue\anaconda3\envs\LIGN\lib\site-packages\torch\autograd\__init__.py", line 127, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.HalfTensor [50]], which is output 0 of AsStridedBackward, is at version
2256; expected version 2255 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!
If you still want to see the full project, I have it on Github. It is an older commit now. To produce the error, run performance/cora.py
while inside the performance
directory.