AMP and GPU's with low half-precision and/or no Tensor Cores

My understanding was that having GPU’s with poor half-precision would be a big obstacle for running AMP. For example, compare the following GPU’s:

Single Precision GFLOPS. 10609
Half Precision GFLOPS 166
Tensor Cores 0

Compare this with an RTX GPU like the 2080 Ti

RTX 2080Ti
Single Precision GFLOPS. 11750
Half Precision GFLOPS 23500
Tensor Cores 544

In AMP/Mixed Precision many of the operations are taking place at half-precision, this gives you incredible memory savings and speed.

I tested this with some models tonight on my 1080Ti’s (4 using DataParallel), and I was blown away. Epochs complete 30% faster, and I could almost double my batch size. So I don’t understand why it wasn’t dogged slow since the half-precision GFLOPS on a 1080Ti are abysmal. Is AMP doing some sort of trickery, such as using 16bit and stuff two numbers into a 32bit register or something?

It’s possible your network is mostly bandwidth bound rather than compute bound, in which case the 1/2X memory traffic for fp16 ops would allow a speedup. It’s also probable that many Pytorch kernels compute internally in fp32 even if the input/output is fp16, so the lack of raw fp16 throughput isn’t a problem. It’s also possible the same is true for cuda library calls (gemms/convolutions): they may use fp32 compute internally for fp16 input/output, so the compute throughput is not worse than fp32 and the required bandwidth is reduced.

1 Like