Training partially with fp16


I was wondering if it’s possible to train partially with fp16.
I’m dealing with a very big network. I would like to use fp16 in a submodule of the network.
Does autograd deal with casting from float16 to float32 during the backprop?

In other post i saw someone who recommends to use fp32 on batch normalization not to have convergence issues. So there should be no problem right?


Yes, it’s possible.
However, using FP16, you should take care of some possible pitfalls, e.g. imprecise weight updates or gradient underflow. NVIDIA recently published their mixed precision utilities for PyTorch named apex.
On the one side you have the automatic mixed precision training using amp, while on the other side the FP16_Optimizer gives you a bit more control over your workflow.
Have a look at this small tutorial I’ve written a few days ago. It’s still not finished and @mcarilli is currently reviewing it, but should be a good starter I hope.