GPU memory shoot up while using cuda11.3

Azarudeen · December 9, 2021, 2:33pm

We are using NER model. Our model was trained using torch 1.6. It took around 1.9 GB of RAM. We have moved from Tesla V100 to A100. As A100 can be run only on Cuda>=11.0 we have been forced to install the latest PyTorch with 11.3 support. While doing this, the memory needed for our model increased from 1.9 GB to 3.2 GB.

Could anyone shed some idea here on why such a huge difference and what can be done to reduce this? Will retraining model using torch 1.10 reduce memory usage?

I am attaching stats on different version and their memory usage. Thanks for your help.

ptrblck · December 14, 2021, 5:16am

I don’t know how you’ve measured the GPU memory usage and also don’t know if you are facing any issues (e.g. if you had to lower the batch size or change the training in any way).
Some expected reasons for a different memory usage:

the CUDA context size is expected to increase due to the additional SMs on A100 compared to V100
besides that, also the libraries increased in size (CUDA, cuDNN etc.)
If you are checking the peak memory, note that cuDNN could pick kernels with a larger workspace on A100, if they are not yielding OOM errors and are giving a speedup (especially if cudnn.benchmark=True is used). You can limit the cuDNN workspace via CUDNN_CONV_WSCAP_DBG

Azarudeen · December 14, 2021, 6:29am

@ptrblck

Thanks for the reply. I took what amount of GPU was occupied after loading the model. I don’t face any issues other than the difference in memory consumption.