Hello,
I’m doing mixed-precision training (from the native amp in pytorch 1.6) on feedforward neural networks. Both the training time and memory consumed have increased as a result.
The GPU is RTX 2080Ti. I tried to have all of the dimensions in multiples of 8 as well.
The training time is less important to me, I mainly want to decrease the memory footprint as much as possible since I’m using large feedforward neural networks only.
Well if ur memory consumption is so high that it irritates u then I suggest u downsample ur data with convolutional layers (if the problem is not a regression model problem) and try to update most of ur code variables inplace
The good effects of half precision floating points may just be negligible in this case
U said u were using mixed precision right? Then just making it a single precision might give u what u want I guess
Unfortunately that’s exactly the problem, even on a model that takes 9GB, there is a memory increase when using fp16. I even tried pytorch lightning ‘precision=16’ (which uses native amp) and still no decrease. I’m guessing it’s related to the fact that I’m strictly using feedforward networks but I’m not entirely sure why.
It might be the feed forward networks though I’m not entirely sure
Well what I’ll suggest is if u have the time try creating a conv network and in one Instance use single precision and in another use mix precision and check the difference in memory usage
Are you seeing a speedup or slowdown?
I’m a bit confused, since you’ve mentioned both.
Also, how are you measuring the memory usage?
Note that nvidia-smi shows the total allocated memory (CUDA context + cache + allocated memory), so you should use torch.cuda.memory_allocated() instead.
It seems that it doesn’t work well with feed forward. I don’t see a speedup nor reduced memory. However, when using other networks such as conv networks (as suggested) then there is indeed a significant speedup and memory reduction.
run in the configured precision levels
If the layer input does not match the target precision, it will convert the type automatically.
Using mixed-precision could sometimes be slow due to the type-casting operation.
At inference time, using pure half-precision is faster than using amp.
But I’m not sure why your memory consumption is higher.
I always see less memory consumption with mixed-precision training on RTX 2080 Ti.
Maybe the above step 2 needs features in both fp32 and fp16 for precision switching layers.