I noticed that training a ResNet-50 with AMP on my new laptop with RTX3070 takes much more GPU memory than without AMP. It is not a code issue because I am able to run the same code on a workstation with an Nvidia Tesla p100 with the opposite result.
Laptop:
Ubuntu 20.04 with Kernel 5.8 (and 5.11)
GPU: RTX3070
Latest Nvidia drivers installed with Ubuntu GUI software and updates (CUDA 11.2)
command numba -s returns no error regarding CUDA install. Same for nvidia-smi.
Training without AMP: 3.9 GB VRAM
Training with AMP: 7.4 GB VRAM
GPU memory consumption is stable during training
I would suggest to double-check your PyTorch version on the laptop (by printing torch.__version__) to make sure that you got a very fresh relelease/nightly build. It seems really easy to have conda trick you into using an old version.
As explained by @ptrblck , here, there was a bug not too long ago:
About the memory: With AMP’s automatic casting so for very small networks, the memory cost of the tensors that are there twice (in FP32 originally and cast to FP16) might be larger than the benefit. (Assuming you use PyTorch’s memory counting, if you use nvidia-smi you’d might also run into caching allocator things.
Thank you again @tom, I understood the memory difference between my laptop and workstation.
Using memory_allocated() both my laptop and workstation allocate 438 MB GPU memory to tensors with AMP, and 435 MB without AMP.
But using memory_reserved() my laptop reserve 6.4 GB with AMP and 2.9 GB without AMP; the workstation reserves 1.63 GB with AMP and 2.79 GB without AMP.
So the AMP reduces Pytorch memory caching on Nvidia P100 (Pascal architecture) but increases memory caching on RTX 3070 mobile (Ampere architecture). I was expecting AMP to decrease memory allocation/reserved, not to increase it (or at least the same). As I saw in a thread that FP32 and FP16 tensors are not duplicated in GPU memory.
I also tried to train a ResNet-152 (224*224 images, batch size 32). AMP is definitely slowing training on my laptop:
The binaries are missing cutlass kernels in the statically linked cudnn as described here, so you would have to build it from source and check the performance again.