High CPU Usage?

I tried AMP on my training pipeline. While the memory usage certainly decreased by a factor of 2, the overall runtime seems to be the same?

I ran some testing with profiler and it seems like the gradient scaling step takes over 300ms of CPU time? Seems like gradient scaling defeats the purpose of all the speed up we receive from AMP?

Also, while I observed similar times for AMP vs. regular training, the reported CUDA time + CPU time from profile seems to suggest AMP takes twice as long?

Also, unrelated to AMP, seems like I have a step aten::mul_ taking a couple of milliseconds that appears to be from ADAM? So is the Adam optimizer done on the CPU, which explains this extra CPU time?

Did you check the end2end runtime in FP32 vs. AMP training?
If so, how long look each iteration (after a warmup phase) and which device are you using?

Yes, I did. The reported stats are the ones after warmup phase. I am using RTX 3060.

I also discovered that I get a marginal (roughly 25% speedup) with amp using
torch.backends.cudnn.benchmark = True

Just curious on why this is the case.

By default TF32 is already used, so the TensorCores on your device will already be utilized, which might not leave a lot of performance benefits left for AMP. This issue with the related double post might be interesting for you.
In any case, you could create profiles with a visual profiler such as Nsight Systems and check for other bottlenecks, such as data loading, which might be the current bottleneck.
Also refer to the Performance Tuning Guide and, if possible, you could also try out the latest CUDA + cudnn versions by building from source, which could yield additional performance improvements.

Thanks for the information!

So TensorCores are used by default in PyTorch already? I did use the Nsight Systems but the results were very difficult to analyze. It could be due to the fact that TensorCores are already being used.

Just some quick follow up questions:

  1. why does torch.backends.cudnn.benchmark = True improve speed up? Does it utilize CUDA cores in conjunction with tensor cores? I am seeing more kernels being used when monitoring with Nsight Systems. Also seems like this optimization only improves performance of AMP code, but not regular. Do you know why this is the case?

  2. If tensor cores are used by default, are there any advantages of using AMP (other than for optimization in question 1)? because it seems like the gradient scaler already takes a lot of CPU time. I am also afraid of AMP potentially causing instabilities in network training.

  3. Are newer version of CUDA supported by pytorch? I am using CUDA 11.1.0, which is the latest CUDA supported by pytorch according to the website.

  4. One of the post you listed mentioned using last channel format for additional performance increase. However, when I did this with AMP, I actually observed a performance decrease. Can this occur or is something wrong with my setup?

Thanks again for the help.

On Ampere GPUs, yes, since the TF32 type is used.

  1. cudnn.benchmark = True will profile different algorithms for each convolution and select the fastest one. The algorithm selection will be stored in a cache for each new input shape. The performance tuning guide explains it in more detail. It’s also not specific to AMP.

  2. Yes, potentially an additional speedup can be seen, which might not be as large as pure FP32 training vs. AMP.

  3. No, all CUDA versions are supported. The pip wheels and conda binaries are shipped with CUDA10.2 and 11.1 at the moment, but you can build from source with any toolkit >=10.0.

  4. The channels-last memory format could yield a speedup when using AMP (refer to the performance guide). If you are seeing a slowdown, please post the model definition and input shapes, which would reproduce the issue.

Okay, thank you for the help!