Torch.cuda.amp cannot speed up on A100

ptrblck · May 7, 2021, 8:00am

I’m not familiar with this model, but note that you are already using TensorCores on the A100, since TF32 is enabled by default. With a proper synchronization I get a runtime of:

FP32: 102s
TF32: 50s
AMP: 47s

Based on the profile it also seems that the majority or kernels are (vectorized/unrolled) elementwise kernels and sporadically a cublas kernel is called:
FP32:

FP16:

Based on the Ops that can autocast to float16 it seems that no many of these operations are used.