Hi!
I’m running a script and, always, at the iteration number 30 of the second epoch, which has nothing special, I receive the following CUDA error:
File "/home/script.py", line 86, in forward
print("X before mask: ", x, flush=True)
File "/home/envs/myenv/lib/python3.8/site-packages/torch/_tensor.py", line 203, in __repr__
return torch._tensor_str._str(self)
File "/home/envs/myenv/lib/python3.8/site-packages/torch/_tensor_str.py", line 406, in _str
return _str_intern(self)
File "/home/envs/myenv/lib/python3.8/site-packages/torch/_tensor_str.py", line 381, in _str_intern
tensor_str = _tensor_str(self, indent)
File "/home/envs/myenv/lib/python3.8/site-packages/torch/_tensor_str.py", line 242, in _tensor_str
formatter = _Formatter(get_summarized_data(self) if summarize else self)
File "/home/envs/myenv/lib/python3.8/site-packages/torch/_tensor_str.py", line 90, in __init__
nonzero_finite_vals = torch.masked_select(tensor_view, torch.isfinite(tensor_view) & tensor_view.ne(0))
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
It’s absolutely always at the same point of the computation. Trying to debug, I try to print the element that raises the error, but I cannot even print it. I can print the shape though. If I use the set_detect_anomaly, I also get:
/opt/conda/conda-bld/pytorch_1623448278899/work/aten/src/ATen/native/cuda/DistributionTemplates.h:592: operator(): block: [118,0,0], thread: [449,0,0] Assertion `0 <= p4 && p4 <= 1` failed.
Any idea? I’m absolutely lost, I don’t know what can it be. Obviously there must be some kind of error that I’m not able to find.
Thanks!