AMP autocast not faster than FP32

ptrblck · August 27, 2021, 10:15pm

Your A6000 would use TF32 by default and would thus already speedup your wokload using TensorCores. This post has additional information, but skip the “channels-last” part since you are working on a language model.
Additionally, this post discusses a similar issue and provides a profile of the workload.