CUDNN_STATUS_NOT_SUPPORTED when using large samples

Ben_Marinberg · September 4, 2019, 7:14am

Hi all,
I’m trying to run a deep pure-3Dconvolution on my data.
I use very heavy batch, i.g. batch size 100 and my model parameters is ~ 37Gb.
There are 2 high-end graphic card available to me : Quadro RTX 8000 and Quardro GV100 that has 48Gb and 32Gb accordingly.
DataParallel have crashed whenever I tried it, and any batch above 10 samples(about 4Gb) crashed even on one of these (and clearly there is enough space on GPU’s RAM).
I solved it after a month of trying by adding ‘torch.backends.cudnn.enabled=False’ at the beginning of the code.
apparently Cudnn cant support very large data properly but using CUDA straight on works perfectly!
It allowed me to use Quadro and Titan RTX mixture , whice means it worked on different GPUs with varying RAM sizes!!

I would recommend any user that got this error to add
torch.backends.cudnn.enabled=False
to his script and re-run, I think it will solve the problem most of the times!
dear PyTorch support team, please look into this.

ptrblck · September 4, 2019, 9:42am

Your issue might be related to this one.
We are currently working on a fix to enable bigger input/output sizes.
Sorry for the delay.

Ben_Marinberg · September 4, 2019, 10:00am

thanks @ptrblck for responding, and it is similar, though my batches has ~2^25 elements which is less than 2^31
Is CUDA really much slower that cudnn? by how much?
because if this is the difference is small enough maybe this walk-around should be posted somewhere more noticeable in the meanwhile.