CUDNN_STATUS_NOT_SUPPORTED when using large samples

Hi all,
I’m trying to run a deep pure-3Dconvolution on my data.
I use very heavy batch, i.g. batch size 100 and my model parameters is ~ 37Gb.
There are 2 high-end graphic card available to me : Quadro RTX 8000 and Quardro GV100 that has 48Gb and 32Gb accordingly.
DataParallel have crashed whenever I tried it, and any batch above 10 samples(about 4Gb) crashed even on one of these (and clearly there is enough space on GPU’s RAM).
I solved it after a month of trying by adding ‘torch.backends.cudnn.enabled=False’ at the beginning of the code.
apparently Cudnn cant support very large data properly but using CUDA straight on works perfectly!
It allowed me to use Quadro and Titan RTX mixture , whice means it worked on different GPUs with varying RAM sizes!!

I would recommend any user that got this error to add
to his script and re-run, I think it will solve the problem most of the times!
dear PyTorch support team, please look into this.

Your issue might be related to this one.
We are currently working on a fix to enable bigger input/output sizes.
Sorry for the delay.

thanks @ptrblck for responding, and it is similar, though my batches has ~2^25 elements which is less than 2^31
Is CUDA really much slower that cudnn? by how much?
because if this is the difference is small enough maybe this walk-around should be posted somewhere more noticeable in the meanwhile.