cuDNN error: CUDNN_STATUS_NOT_SUPPORTED.This error may appear if you passed in a non-contiguous input

ZhangJ · June 22, 2020, 7:52am

I used the code CRNN and add horovod to do distributed training.
when the dataset images about 1000w, the batchsize is 128, and used 6 V100, everything is ok.
But when I add dataset, and the images about 1900w, the batchszie is also 128, this error happens when the network forwards the conv layer.
I noticed somebody said decrease the batchsize, so I changed the batchsize 128 -> 64, and the error is cuda OOM. So I continue to reduce bathsize to 32, OOM error also exists.
I’m confused. Can someone help me? Thanks

ptrblck · June 22, 2020, 7:55am

As you’ve already mentioned, the cudnn error might be raised, if you are in fact running out of memory.
Try to reduce the batch size until the use case fits in your GPU memory and let us know, if you still encounter a cudnn error.

ZhangJ · June 22, 2020, 7:57am

my env is built form docker pytorch/pytorch:1.2-cuda10.0-cudnn7-devel
I noticed https://github.com/pytorch/pytorch/issues/32395,
so I try to change the pytorch version from 1.2 to 1.3, error also exists

ZhangJ · June 22, 2020, 8:00am

What I’m more puzzled about is that in my first experiment, 128 batchsize can work, but adding data can’t., even i changed the batchsize from 128 to 32

ptrblck · June 22, 2020, 8:26am

It seems that your second experiment uses larger images, which have a width of 1900 pixels instead of 1000 from the first experiment.
If that’s the case and I don’t misunderstand the information, it’s expected that this use case uses more memory.

I would recommend to use the latest stable release (1.5.1) or build from master to get the latest fixes.
1.2 and 1.3 are both old by now.

ZhangJ · June 22, 2020, 8:53am

I am so sorry for my incorrect description. dataset images “1000w” means ten million images, not the width pixel, sorry.
i will try the latest pytorch docker image.
I just tried a smaller batchszie, and there was no OOM error, but the cudnn error also exits.

ptrblck · June 22, 2020, 9:04am

The cudnn error could still mask the OOM issue.
You could try to disable cudnn via torch.backends.cudnn.enabled = False and check the lower batch size again and also use the benchmark mode via torch.backends.cudnn.benchmark = True or deterministic via torch.backends.cudnn.deterministic = True.

If the code is running fine without cudnn and just raises the NOT_SUPPORTED issue, could you post your model all necessary input shapes, so that we could try to debug this issue?

ZhangJ · June 22, 2020, 9:39am

i will try it.
thanks a lot

zhongqiang_shi · July 27, 2023, 5:35am

I encounted the same issue, which supposed not to be the out of memory problem but the data problem. I can not solve it till this moment.