Hi all,
I’m trying to run a deep pure-3Dconvolution on my data.
I use very heavy batch, i.g. batch size 100 and my model parameters is ~ 37Gb.
There are 2 high-end graphic card available to me : Quadro RTX 8000 and Quardro GV100 that has 48Gb and 32Gb accordingly.
DataParallel have crashed whenever I tried it, and any batch above 10 samples(about 4Gb) crashed even on one of these (and clearly there is enough space on GPU’s RAM).
I solved it after a month of trying by adding ‘torch.backends.cudnn.enabled=False’ at the beginning of the code.
apparently Cudnn cant support very large data properly but using CUDA straight on works perfectly!
It allowed me to use Quadro and Titan RTX mixture , whice means it worked on different GPUs with varying RAM sizes!!
I would recommend any user that got this error to add
torch.backends.cudnn.enabled=False
to his script and re-run, I think it will solve the problem most of the times!
dear PyTorch support team, please look into this.