AMP autocast not faster than FP32

Your A6000 would use TF32 by default and would thus already speedup your wokload using TensorCores. This post has additional information, but skip the “channels-last” part since you are working on a language model.
Additionally, this post discusses a similar issue and provides a profile of the workload.

1 Like