Illegal memory access error in torch.max_pool2d

mandreu · August 31, 2023, 3:26pm

Hi,

I am getting the following error, with torch.backends.cudnn.enabled = False set:

  File "/opt/conda/envs/_/lib/python3.9/site-packages/torch/_jit_internal.py", line 422, in fn
    return if_false(*args, **kwargs)
  File "/opt/conda/envs/_/lib/python3.9/site-packages/torch/nn/functional.py", line 720, in _max_pool2d
    return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
RuntimeError: CUDA error: an illegal memory access was encountered

Without torch.backends.cudnn.enabled = False, this is the error I get:

  File "/opt/conda/envs/_/lib/python3.9/site-packages/torch/nn/functional.py", line 2283, in batch_norm
    return torch.batch_norm(
RuntimeError: cuDNN error: CUDNN_STATUS_NOT_SUPPORTED. This error may appear if you passed in a non-contiguous input.

As far as I can tell, the error occurs when input to torch.max_pool2d exceeds a certain size (e.g. while [16, 64, 1200, 1712] still works, [16, 64, 1184, 1824] already throws the error). I can work around this by reducing the size per image or by reducing the batch size (the first element, i.e., 16).

It is reliably reproducible: could not pinpoint the exact threshold as input size varies due to augmentations, but when I go high enough with the initial image size, it always occurs, at the first batch.

From that I assume it is an OOM issue – but not sure if that is correct or if there is a bug, there are any configs or something to be changed? The thing is that I also get regular OOM errors in other places so it confuses me that that one differs. Either it only differs in error handling or it differs in the actual error; I can’t tell.
Can this type of error actually point to an OOM issue or is it a different issue?

ptrblck · August 31, 2023, 3:45pm

Could you post a minimal and executable code snippet to reproduce the issue as well as the output of python -m torch.utils.collect_env?