I have just upgraded to latest 1.4 version. Now it runs in a for loop smoothly and then suddenly shows this error again. Some times after 10 batches, some times after 400, completely unpredictable behavious, everything else is fixed.
What I have is a VAE, vanilla version, working smoothly and giving good results. Then I added that output of the function above to ELBO loss (acording to 2019 paper), that’s all the change I did. And I started getting that CUDA error.
Once it happens, the entire GPU becomes inaccessible, nothing can be put there, and nothing there can be accessed. Here’s error when I ask for a tensor stored in GPU after the error:
Traceback (most recent call last):
File “C:\Users\s4551072.conda\envs\gpuenv\lib\site-packages\IPython\core\formatters.py”, line 224, in catch_format_error
r = method(self, *args, **kwargs)
File “C:\Users\s4551072.conda\envs\gpuenv\lib\site-packages\IPython\core\formatters.py”, line 702, in call
printer.pretty(obj)
File “C:\Users\s4551072.conda\envs\gpuenv\lib\site-packages\IPython\lib\pretty.py”, line 402, in pretty
return _repr_pprint(obj, self, cycle)
File “C:\Users\s4551072.conda\envs\gpuenv\lib\site-packages\IPython\lib\pretty.py”, line 697, in _repr_pprint
output = repr(obj)
File “C:\Users\s4551072.conda\envs\gpuenv\lib\site-packages\torch\tensor.py”, line 159, in repr
return torch._tensor_str._str(self)
File “C:\Users\s4551072.conda\envs\gpuenv\lib\site-packages\torch_tensor_str.py”, line 311, in _str
tensor_str = _tensor_str(self, indent)
File “C:\Users\s4551072.conda\envs\gpuenv\lib\site-packages\torch_tensor_str.py”, line 209, in _tensor_str
formatter = _Formatter(get_summarized_data(self) if summarize else self)
File “C:\Users\s4551072.conda\envs\gpuenv\lib\site-packages\torch_tensor_str.py”, line 242, in get_summarized_data
return torch.stack([get_summarized_data(x) for x in (start + end)])
File “C:\Users\s4551072.conda\envs\gpuenv\lib\site-packages\torch_tensor_str.py”, line 242, in
return torch.stack([get_summarized_data(x) for x in (start + end)])
File “C:\Users\s4551072.conda\envs\gpuenv\lib\site-packages\torch_tensor_str.py”, line 244, in get_summarized_data
return torch.stack([get_summarized_data(x) for x in self])
File “C:\Users\s4551072.conda\envs\gpuenv\lib\site-packages\torch_tensor_str.py”, line 244, in
return torch.stack([get_summarized_data(x) for x in self])
File “C:\Users\s4551072.conda\envs\gpuenv\lib\site-packages\torch_tensor_str.py”, line 242, in get_summarized_data
return torch.stack([get_summarized_data(x) for x in (start + end)])
File “C:\Users\s4551072.conda\envs\gpuenv\lib\site-packages\torch_tensor_str.py”, line 242, in
return torch.stack([get_summarized_data(x) for x in (start + end)])
File “C:\Users\s4551072.conda\envs\gpuenv\lib\site-packages\torch_tensor_str.py”, line 235, in get_summarized_data
return torch.cat((self[:PRINT_OPTS.edgeitems], self[-PRINT_OPTS.edgeitems:]))
RuntimeError: cuda runtime error (700) : an illegal memory access was encountered at …\aten\src\THC\THCCachingHostAllocator.cpp:278