Fp16 training with feedforward network slower time and no memory reduction

I’m doing mixed-precision training (from the native amp in pytorch 1.6) on feedforward neural networks. Both the training time and memory consumed have increased as a result.

The GPU is RTX 2080Ti. I tried to have all of the dimensions in multiples of 8 as well.

The training time is less important to me, I mainly want to decrease the memory footprint as much as possible since I’m using large feedforward neural networks only.


Well if ur memory consumption is so high that it irritates u then I suggest u downsample ur data with convolutional layers (if the problem is not a regression model problem) and try to update most of ur code variables inplace

Not really applicable in my situation. But I’m wondering more about why fp16 isn’t reducing my memory at all…

The good effects of half precision floating points may just be negligible in this case
U said u were using mixed precision right? Then just making it a single precision might give u what u want I guess

Currently single precision is indeed faster. But I need fp16 primarily to reduce my memory footprint for when I want to run bigger networks.

The fp16 should still decrease ur memory foot print even if it’s by a small factor.
It’s possible that the decrease is so small its negligible.

Unfortunately that’s exactly the problem, even on a model that takes 9GB, there is a memory increase when using fp16. I even tried pytorch lightning ‘precision=16’ (which uses native amp) and still no decrease. I’m guessing it’s related to the fact that I’m strictly using feedforward networks but I’m not entirely sure why.

It might be the feed forward networks though I’m not entirely sure
Well what I’ll suggest is if u have the time try creating a conv network and in one Instance use single precision and in another use mix precision and check the difference in memory usage

Are you seeing a speedup or slowdown?
I’m a bit confused, since you’ve mentioned both.

Also, how are you measuring the memory usage?
Note that nvidia-smi shows the total allocated memory (CUDA context + cache + allocated memory), so you should use torch.cuda.memory_allocated() instead.

It seems that it doesn’t work well with feed forward. I don’t see a speedup nor reduced memory. However, when using other networks such as conv networks (as suggested) then there is indeed a significant speedup and memory reduction.

torch.cuda.amp.autocast works in this way:

  1. cast the layer into fp16 if the corresponding operation is fp16-safe.
    For example, batch normalization will stay in fp32.
  2. run in the configured precision levels
    If the layer input does not match the target precision, it will convert the type automatically.

Using mixed-precision could sometimes be slow due to the type-casting operation.
At inference time, using pure half-precision is faster than using amp.

But I’m not sure why your memory consumption is higher.
I always see less memory consumption with mixed-precision training on RTX 2080 Ti.
Maybe the above step 2 needs features in both fp32 and fp16 for precision switching layers.

You may want to refer to this paper.

Found a 10x memory increase with native fp16 (pt1.6) - no such problem with nvidia-apex (pt1.5)