Half precision training time same as full precision

Hi all,

I am training a model (~25M params) which takes ~1day to train on the full dataset. In my attempt to reduce the training time I am testing half precision. In using half precision, it allowed me to double the batch size, yet the overall training time remains the same.

I was under the impression that doubling the batch size would reduce training time in ~half since many operations can be parallelized over the batch dimension. But this does not seem to be the case. Could this be a hardware (GPU architecture) or CUDA version problem? Or is my understanding of the relationship between batch size and speed just plain wrong.

I have tried samples to bfloat16 and weights to float32, and both samples and weights to bfloat16 with same training times as both in float32. I am using a NVIDIA Quadro GV100 with 32Gb of VRAM and CUDA 11.4.

Thanks

Profile your code to narrow down the bottleneck as it seems doubling the batch size is doubling the execution time, which could point to a data loading bottleneck.

Hi @ptrblck,

Thank you for your response, you were quite on the spot. After some digging I found that indeed there was a data loading bottleneck as you pointed out, just running the DataLoader on more workers alleviates the problem.

I however also found that using bfloat16 is quite slower than float16. My guess is that my GPU architecture or CUDA version is the culprit here, but if anyone has any ideas on how to close the speed gap between bfloat16 and float16 I welcome them.

I have tried running the same model on a RTX4090 with CUDA 12 and bfloat16 is now as fast as float16. This indicates that indeed the Quadro GV100 did not support the bfloat16 speed up.

Yes, this is correct as bfloat16 is supported on Ampere and newer devices. Older devices could emulate it but you should not expect to see a speedup.