I cannot reproduce a slowdown using a recent PyTorch build with CUDA 11.6 and see:
benching FP32...
epoch 0 took 5.307978553988505s
epoch 1 took 3.9439379789982922s
epoch 2 took 3.9074656259908807s
benching TF32...
epoch 0 took 4.101563502015779s
epoch 1 took 3.946826447005151s
epoch 2 took 4.008894422004232s
benching FP16...
epoch 0 took 4.420351507025771s
epoch 1 took 4.3506982740073s
epoch 2 took 4.440277656976832s
benching BF16...
epoch 0 took 4.17264149300172s
epoch 1 took 4.073878110008081s
epoch 2 took 4.178816699975869s
A few things for your profiling:
- You would need to synchronize the code also before starting the timer, not only before stopping it in case some kernels are still running (e.g. transferring the weights to the GPU).
- You are profiling a full training run including the
DataLoader
, transferring the data to the GPU, and are using a tiny model. Depending on your system the actual model runtime might be tiny and you might see a large overhead from the actual data loading etc. so that actual model speedups won’t be directly visible. If you want to compare different numerical precisions and their speed, I would recommend to profile the model in isolation first. - You model workload is small so even if lower precision
dtype
s give a speedup, the actual kernel launches, the dispatching etc. might be visible. If this small model is your real workload, you might want to try CUDA graphs. - To enable
bfloat16
calculations in conv layers, setos.environ["TORCH_CUDNN_V8_API_ENABLED"] = "1"
orexport
this env variable. However,bfloat16
should be enabled for Ampere+, so your Turing GPU might not see any benefits.