I’ve been confused lately regarding torch.cuda.amp and Nvidia apex. What each of them does? (As such I know that both of them does mixed precision) How do they differ and when to use which?
torch.cuda.amp
just landed recently in the nightly builds and is the recommended way to use mixed-precision training.
CC @mcarilli, who was working on pushing this utility into the PyTorch core.
tl;dr torch.cuda.amp
is the way to go moving forward.
We published Apex Amp last year as an experimental mixed precision resource because Pytorch didn’t yet support the extensibility points to move it upstream cleanly. However, asking people to install something separate was a headache. Extension building and forward/backward compatibility were particular pain points.
Given the benefits of automatic mixed precision, it belongs in Pytorch core, so moving it upstream has been my main project for the past six months. I’m happy with torch.cuda.amp
. It’s more flexible and intuitive than Apex Amp, and repairs many of Apex Amp’s known flaws. Apex Amp will shortly be deprecated (and to be honest I haven’t been working on it for a while, I focused on making sure torch.cuda.amp
covered the most-requested feature gaps).
Prefer torch.cuda.amp
, early and often. It supports a wide range of use cases. If it doesn’t support your network for some reason, file a Pytorch issue and tag @mcarilli. In general, prefer native tools for versioning stability (that means torch.nn.parallel.DistributedDataParallel
too) because they’re tested and updated as needed for each master commit or binary build.
Apex will remain as a source of utilities that can be helpful, e.g. fast fused optimizers, but forward+backward compatibility across all Pytorch versions can’t be guaranteed. Don’t take a dependency on Apex unless you want to try those.
Hello. I have only single GPU and a very heavy model, so I am running out of memory…
Recently, I figured out that there is mixed precision training which saves the GPU memory during training.
Is NVIDIA apex an older tool to use mixed training?
I read some post and saw your comment there is a pytorch implementation which is called torch.cuda.amp.
How do I use it and what is nightly builds you commented above?
Hi @Yangmin
You may want to refer to:
https://pytorch.org/docs/stable/notes/amp_examples.html
Yes, apex.amp
was our first implementation of mixed-precision training, is deprecated now, and replaced with torch.cuda.amp
. @seungjun posted the examples above to see its usage.
You don’t need to install a nightly release anymore, as torch.cuda.amp
is available in the stable releases already.
Hi mcarilli!
It is feasible to combine torch.cuda.amp
and fp16_compress_hook? Casting gradients to float16
and allreducing those float16
gradient tensors are helpful to reduce communication cost.
Yes, the action of those hooks should be composable with the things amp does.