FP16 and BF16 way slower than FP32 and TF32

I cannot reproduce a slowdown using a recent PyTorch build with CUDA 11.6 and see:

benching FP32...
epoch 0 took 5.307978553988505s
epoch 1 took 3.9439379789982922s
epoch 2 took 3.9074656259908807s

benching TF32...
epoch 0 took 4.101563502015779s
epoch 1 took 3.946826447005151s
epoch 2 took 4.008894422004232s

benching FP16...
epoch 0 took 4.420351507025771s
epoch 1 took 4.3506982740073s
epoch 2 took 4.440277656976832s

benching BF16...
epoch 0 took 4.17264149300172s
epoch 1 took 4.073878110008081s
epoch 2 took 4.178816699975869s

A few things for your profiling:

  • You would need to synchronize the code also before starting the timer, not only before stopping it in case some kernels are still running (e.g. transferring the weights to the GPU).
  • You are profiling a full training run including the DataLoader, transferring the data to the GPU, and are using a tiny model. Depending on your system the actual model runtime might be tiny and you might see a large overhead from the actual data loading etc. so that actual model speedups won’t be directly visible. If you want to compare different numerical precisions and their speed, I would recommend to profile the model in isolation first.
  • You model workload is small so even if lower precision dtypes give a speedup, the actual kernel launches, the dispatching etc. might be visible. If this small model is your real workload, you might want to try CUDA graphs.
  • To enable bfloat16 calculations in conv layers, set os.environ["TORCH_CUDNN_V8_API_ENABLED"] = "1" or export this env variable. However, bfloat16 should be enabled for Ampere+, so your Turing GPU might not see any benefits.