Training with Half Precision

What speed gain do you achieve using FP16 instead of FP32 on pytorch (on resnet or simular)?

It really depends. On a Titan X or P100, you get about 15% speedup for all the architectures I’ve tried. On a Titan V or V100, I get about a 50% speedup for resnet50 and 2x speedup on Xception, probably because of the way tensor cores work. The 2x speedup actually makes the Titan V worth it if you are going to be training a lot of networks that use grouped convolution. You also get to use double the batch size because the smaller floats all fit in vram.

I should say that the speed-up isn’t painless. I’ve had issues with fp16 overflow at times. Usually these are fixable, but you only find out about them after investing a significant amount of time training.

I also worry about the added complexity leading me to make wrong conclusions about my experimental outcomes. (i.e. could this issue be an fp16 issue? Or thinking one model works better but really the other model was faulty.)

5 Likes

You also need to be sure to maintain a full 32 bit copy of your parameters. This helps stability substantially.

We’ve developed a lightweight, open-source set of Pytorch tools to enable easier, more numerically stable mixed precision training: https://github.com/nvidia/apex. Mixed precision means that the majority of the network uses FP16 arithmetic (reducing memory storage/bandwidth demands and enabling Tensor Cores for gemms and convolutions), while a small subset of operations are executed in FP32 for improved stability.

Highlights include:

  • Amp, a tool that executes all numerically safe Torch functions in FP16, while automatically casting potentially unstable operations to FP32. Amp also automatically implements dynamic loss scaling. Amp is designed to offer maximum numerical stability, and most of the speed benefits of pure FP16 training.
  • FP16_Optimizer, an optimizer wrapper that automatically implements FP32 master weights for parameter updates, as well as static or dynamic loss scaling. FP16_Optimizer is designed to be minimally invasive (it doesn’t change the execution of Torch operations) and offer almost all the speed of pure FP16 training with significantly improved numerical stability.
  • apex.parallel.DistributedDataParallel, a distributed module wrapper that achieves high performance by overlapping computation with communication during backward(). Apex DistributedDataParallel is useful for both pure FP32 as well as mixed precision training.

Full API documentation can be found here.

17 Likes

Our examples page demonstrates the use of FP16_Optimizer and Apex DistributedDataParallel. Amp examples are coming soon, and Amp’s use is thoroughly discussed in its README.

Give Apex a try and let us know what you think!

sorry for double post, the forum page told me “new users may only post 2 links at a time” or something along those lines.

7 Likes

The link to csarofeen/examples does not work any more. You can find an example here: Fp16 on pytorch 0.4

3 Likes

Hi thanks for your explanation.
May I ask why the BN must use float32, does that mean BN us different from other layers, like conv, linear, etc?

I’d say the easiest way to use and not make a mistake is to use PyTorch Lightning with
Trainer(use_amp=True).

This will train your model using 16-bit.

https://pytorch-lightning.readthedocs.io/en/0.6.0/trainer.html

1 Like

Thanks @mcarilli. Apex was very useful to us in our project.

any suggestions on using float16 with transformers. Should I keep some layers in float32 just like batch-normalization is recommended to keep in float32?

I would generally recommend to use the automatic mixed precision package (via torch.cuda.amp), which uses casts the input to the appropriate dtype for each method.

2 Likes

okay thanks. Should we keep val_step under autocast scope as well for fair comparison between tr_loss & val_loss?

Yes, you can also use autocasting during the validation.
Especially if you plan on using it for the test dataset (or deployment) I would use it.

1 Like

I used torch.cuda.amp tools to training an u-net-like network but my loss function gave NaN. I guess this is overflow’s problem when using fp16. Can you give me some advice to overcome this? Thank you so much!

Could you check if the output of the model is already creating invalid values?
If so, could you check the intermediate activation values for any invalid values (e.g. using torch.isfinite(out).all()) to narrow down the first occurrence?

1 Like

What do you mean by “first occurrence”? I use torch.autograd.set_detect_anomaly(True) and the output said there is NaN problem with sqrtBackward, or addBackWard or CuDnnConvolutionBackward sometime at the 0th input.
I think this is the amp’s problem just because it doesn’t happen when I turn AMP off.
Thank you!

By “first occurrence” I meant the first activation which shows an invalid value to narrow down the operation.
Since you are apparently seeing different operations at the moment, this would help narrow down the offending operation (e.g. an eps value used in sqrt might be too small when using amp and could thus underflow).

1 Like

I feel more clear now!
Let assume that I have many layers that use sqrt operation. So how can I detect which one cause the overflow/underflow problem?
Is there any possible way to find it without modifying the forward pass of each layer to figure out the first occurrence?
Thank you!

You could use forward hooks as described here which would allow you to check the outputs without changing the forward function in case you are using nn.Modules.

2 Likes

Thank you so much. I will try and update the result!