Fp16 training with feedforward network slower time and no memory reduction

Najeeb_Nabwani · September 8, 2020, 2:11pm

Hello,
I’m doing mixed-precision training (from the native amp in pytorch 1.6) on feedforward neural networks. Both the training time and memory consumed have increased as a result.

The GPU is RTX 2080Ti. I tried to have all of the dimensions in multiples of 8 as well.

The training time is less important to me, I mainly want to decrease the memory footprint as much as possible since I’m using large feedforward neural networks only.

Thanks.

Henry_Chibueze · September 8, 2020, 2:23pm

Well if ur memory consumption is so high that it irritates u then I suggest u downsample ur data with convolutional layers (if the problem is not a regression model problem) and try to update most of ur code variables inplace

Najeeb_Nabwani · September 8, 2020, 2:42pm

Not really applicable in my situation. But I’m wondering more about why fp16 isn’t reducing my memory at all…

Henry_Chibueze · September 8, 2020, 2:46pm

The good effects of half precision floating points may just be negligible in this case
U said u were using mixed precision right? Then just making it a single precision might give u what u want I guess

Najeeb_Nabwani · September 8, 2020, 2:53pm

Currently single precision is indeed faster. But I need fp16 primarily to reduce my memory footprint for when I want to run bigger networks.

Henry_Chibueze · September 8, 2020, 3:06pm

The fp16 should still decrease ur memory foot print even if it’s by a small factor.
It’s possible that the decrease is so small its negligible.

Najeeb_Nabwani · September 8, 2020, 3:09pm

Unfortunately that’s exactly the problem, even on a model that takes 9GB, there is a memory increase when using fp16. I even tried pytorch lightning ‘precision=16’ (which uses native amp) and still no decrease. I’m guessing it’s related to the fact that I’m strictly using feedforward networks but I’m not entirely sure why.

Henry_Chibueze · September 8, 2020, 3:28pm

It might be the feed forward networks though I’m not entirely sure
Well what I’ll suggest is if u have the time try creating a conv network and in one Instance use single precision and in another use mix precision and check the difference in memory usage

ptrblck · September 11, 2020, 6:26am

Are you seeing a speedup or slowdown?
I’m a bit confused, since you’ve mentioned both.

Also, how are you measuring the memory usage?
Note that nvidia-smi shows the total allocated memory (CUDA context + cache + allocated memory), so you should use torch.cuda.memory_allocated() instead.

Najeeb_Nabwani · September 15, 2020, 11:14am

It seems that it doesn’t work well with feed forward. I don’t see a speedup nor reduced memory. However, when using other networks such as conv networks (as suggested) then there is indeed a significant speedup and memory reduction.

seungjun · September 15, 2020, 11:59am

torch.cuda.amp.autocast works in this way:

cast the layer into fp16 if the corresponding operation is fp16-safe.
https://pytorch.org/docs/stable/amp.html#autocast-op-reference
For example, batch normalization will stay in fp32.
run in the configured precision levels
If the layer input does not match the target precision, it will convert the type automatically.

Using mixed-precision could sometimes be slow due to the type-casting operation.
At inference time, using pure half-precision is faster than using amp.

But I’m not sure why your memory consumption is higher.
I always see less memory consumption with mixed-precision training on RTX 2080 Ti.
Maybe the above step 2 needs features in both fp32 and fp16 for precision switching layers.

You may want to refer to this paper.

stas · November 8, 2020, 5:58pm

Found a 10x memory increase with native fp16 (pt1.6) - no such problem with nvidia-apex (pt1.5)