My understanding was that having GPU’s with poor half-precision would be a big obstacle for running AMP. For example, compare the following GPU’s:
NVIDIA 1080Ti
Single Precision GFLOPS. 10609
Half Precision GFLOPS 166
Tensor Cores 0
Compare this with an RTX GPU like the 2080 Ti
RTX 2080Ti
Single Precision GFLOPS. 11750
Half Precision GFLOPS 23500
Tensor Cores 544
In AMP/Mixed Precision many of the operations are taking place at half-precision, this gives you incredible memory savings and speed.
I tested this with some models tonight on my 1080Ti’s (4 using DataParallel), and I was blown away. Epochs complete 30% faster, and I could almost double my batch size. So I don’t understand why it wasn’t dogged slow since the half-precision GFLOPS on a 1080Ti are abysmal. Is AMP doing some sort of trickery, such as using 16bit and stuff two numbers into a 32bit register or something?