Hi,
I am getting the following error, with torch.backends.cudnn.enabled = False
set:
File "/opt/conda/envs/_/lib/python3.9/site-packages/torch/_jit_internal.py", line 422, in fn
return if_false(*args, **kwargs)
File "/opt/conda/envs/_/lib/python3.9/site-packages/torch/nn/functional.py", line 720, in _max_pool2d
return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
RuntimeError: CUDA error: an illegal memory access was encountered
Without torch.backends.cudnn.enabled = False
, this is the error I get:
File "/opt/conda/envs/_/lib/python3.9/site-packages/torch/nn/functional.py", line 2283, in batch_norm
return torch.batch_norm(
RuntimeError: cuDNN error: CUDNN_STATUS_NOT_SUPPORTED. This error may appear if you passed in a non-contiguous input.
As far as I can tell, the error occurs when input
to torch.max_pool2d
exceeds a certain size (e.g. while [16, 64, 1200, 1712] still works, [16, 64, 1184, 1824] already throws the error). I can work around this by reducing the size per image or by reducing the batch size (the first element, i.e., 16).
It is reliably reproducible: could not pinpoint the exact threshold as input size varies due to augmentations, but when I go high enough with the initial image size, it always occurs, at the first batch.
From that I assume it is an OOM issue – but not sure if that is correct or if there is a bug, there are any configs or something to be changed? The thing is that I also get regular OOM errors in other places so it confuses me that that one differs. Either it only differs in error handling or it differs in the actual error; I can’t tell.
Can this type of error actually point to an OOM issue or is it a different issue?