Gpu memory allocation error and volatility

Hi pytorch community. I was wondering if you could clarify a few things regarding gpu memory allocation. I have a particular model and during training I’ve noticed the following things. During training the gpu memory allocation is less than what the same model would allocate using tensorflow yet it always keep giving me an out of memory error. For instance the max gpu memory allocation from tensorflow is 10769MiB and for pytorch is 10011MiB. Any reason for that?

Besides that, the memory allocation during training keeps fluctuating from as less as 45% and as hight as 95% while in the same tensorflow model that is fixed and is not fluctuating.

Another thing is the fact that during training the gpu volatility of pytorch model is fluctuating a lot going from as less as 0%, sometimes even for a minute and then back up to 95%, while in the same model with tensorflow the gpu volatility is constantly at 100%.

Could someone shed some light on this please?

Thank you!

I had similar issue and spent several days to figure out.
In my case, I turned on torch.backend.cudnn.benchmark and it keeps evaluate on each batch which also causing memory fluctuating.

Hope this will help.

1 Like