AMP uses more GPU memory and slows training

Hi,

I noticed that training a ResNet-50 with AMP on my new laptop with RTX3070 takes much more GPU memory than without AMP. It is not a code issue because I am able to run the same code on a workstation with an Nvidia Tesla p100 with the opposite result.

Laptop:
Ubuntu 20.04 with Kernel 5.8 (and 5.11)
GPU: RTX3070
Latest Nvidia drivers installed with Ubuntu GUI software and updates (CUDA 11.2)
command numba -s returns no error regarding CUDA install. Same for nvidia-smi.
Training without AMP: 3.9 GB VRAM
Training with AMP: 7.4 GB VRAM
GPU memory consumption is stable during training

I installed Pytroch with:
conda install pytorch torchvision torchaudio cudatoolkit=11.1 -c pytorch -c nvidia
(Also tried Pytorch preview 1.9)

Workstation:
GPU: Nvidia tesla p100
CUDA 10.2
Training without AMP: 3.5 GB VRAM
Training with AMP: 2.5 GB VRAM

Do you have any idea why I have this strange behavior on my laptop?

Thanks in advance :slight_smile:

I would suggest to double-check your PyTorch version on the laptop (by printing torch.__version__) to make sure that you got a very fresh relelease/nightly build. It seems really easy to have conda trick you into using an old version.

As explained by @ptrblck , here, there was a bug not too long ago:

Best regards

Thomas

Hi @tom, thank you for your suggestion. Printing torch.version confirms I have 1.8.1 and 1.9.0.dev20210527 in another env.

Best regards

Julien

Thank you for double-checking. The other thread I’d recommend is

again @ptrblck 's a hero for looking at this.

About the memory: With AMP’s automatic casting so for very small networks, the memory cost of the tensors that are there twice (in FP32 originally and cast to FP16) might be larger than the benefit. (Assuming you use PyTorch’s memory counting, if you use nvidia-smi you’d might also run into caching allocator things.

Best regards

Thomas

Thank you again @tom, I understood the memory difference between my laptop and workstation.

Using memory_allocated() both my laptop and workstation allocate 438 MB GPU memory to tensors with AMP, and 435 MB without AMP.
But using memory_reserved() my laptop reserve 6.4 GB with AMP and 2.9 GB without AMP; the workstation reserves 1.63 GB with AMP and 2.79 GB without AMP.

So the AMP reduces Pytorch memory caching on Nvidia P100 (Pascal architecture) but increases memory caching on RTX 3070 mobile (Ampere architecture). I was expecting AMP to decrease memory allocation/reserved, not to increase it (or at least the same). As I saw in a thread that FP32 and FP16 tensors are not duplicated in GPU memory.

I also tried to train a ResNet-152 (224*224 images, batch size 32). AMP is definitely slowing training on my laptop:

Laptop (minutes/epoch):
FP32: 2:02, TF32: 2:03, AMP: 2:20

Workstation (minutes/epoch):
FP32: 1:49, TF32: 1:46, AMP: 1:45

I was expecting TF32 and AMP to increase throughput on Ampere architecture, not to decrease it. Do you have any idea?

The binaries are missing cutlass kernels in the statically linked cudnn as described here, so you would have to build it from source and check the performance again.

Thanks a lot @ptrblck, that solved the issue!

Do you know when the bug will be fixed in conda cudnn?