Your A6000 would use TF32 by default and would thus already speedup your wokload using TensorCores. This post has additional information, but skip the “channels-last” part since you are working on a language model.
Additionally, this post discusses a similar issue and provides a profile of the workload.
1 Like