I tried AMP on my training pipeline. While the memory usage certainly decreased by a factor of 2, the overall runtime seems to be the same?
I ran some testing with profiler and it seems like the gradient scaling step takes over 300ms of CPU time? Seems like gradient scaling defeats the purpose of all the speed up we receive from AMP?
Also, while I observed similar times for AMP vs. regular training, the reported CUDA time + CPU time from profile seems to suggest AMP takes twice as long?
Also, unrelated to AMP, seems like I have a step aten::mul_ taking a couple of milliseconds that appears to be from ADAM? So is the Adam optimizer done on the CPU, which explains this extra CPU time?