What is the difference between FusedAdam optimizer in Nvidia AMP package with the Adam optimizer in Pytorch?
The Adam optimizer in Pytorch (like all Pytorch optimizers) carries out optimizer.step() by looping over parameters, and launching a series of kernels for each parameter. This can require hundreds of small launches that are mostly bound by CPU-side Python looping and kernel launch overhead, resulting in poor device utilization. Currently, the FusedAdam implementation in Apex flattens the parameters for the optimization step, then carries out the optimization step itself via a fused kernel that combines all the Adam operations. In this way, the loop over parameters as well as the internal series of Adam operations for each parameter are fused such that optimizer.step() requires only a few kernel launches.
The current implementation (in Apex master) is brittle and only works with Amp opt_level O2. I’ve got a WIP branch to make it work for any opt_level (https://github.com/NVIDIA/apex/pull/351). I recommend waiting until this is merged then trying it.
Just wondering, wouldn’t it be possible to use pytorch multiprocessing to parallelise the Adam loop? Or CUDA streams?
@sbsky Either technique comes with its own overhead. If the time it takes to launch one of these kernels is >> the time it takes to execute it, you’ll have to optimize the launch itself. If you decide to do this with multiprocessing, you’ll need to move the references to those tensors between processes, which isn’t free. The alternative is to launch fewer kernels, which is what @mcarilli did in AMP.