I am getting the following error, with
torch.backends.cudnn.enabled = False set:
File "/opt/conda/envs/_/lib/python3.9/site-packages/torch/_jit_internal.py", line 422, in fn return if_false(*args, **kwargs) File "/opt/conda/envs/_/lib/python3.9/site-packages/torch/nn/functional.py", line 720, in _max_pool2d return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode) RuntimeError: CUDA error: an illegal memory access was encountered
torch.backends.cudnn.enabled = False, this is the error I get:
File "/opt/conda/envs/_/lib/python3.9/site-packages/torch/nn/functional.py", line 2283, in batch_norm return torch.batch_norm( RuntimeError: cuDNN error: CUDNN_STATUS_NOT_SUPPORTED. This error may appear if you passed in a non-contiguous input.
As far as I can tell, the error occurs when
torch.max_pool2d exceeds a certain size (e.g. while [16, 64, 1200, 1712] still works, [16, 64, 1184, 1824] already throws the error). I can work around this by reducing the size per image or by reducing the batch size (the first element, i.e., 16).
It is reliably reproducible: could not pinpoint the exact threshold as input size varies due to augmentations, but when I go high enough with the initial image size, it always occurs, at the first batch.
From that I assume it is an OOM issue – but not sure if that is correct or if there is a bug, there are any configs or something to be changed? The thing is that I also get regular OOM errors in other places so it confuses me that that one differs. Either it only differs in error handling or it differs in the actual error; I can’t tell.
Can this type of error actually point to an OOM issue or is it a different issue?