My understanding was that having GPU’s with poor half-precision would be a big obstacle for running AMP. For example, compare the following GPU’s:

NVIDIA 1080Ti

Single Precision GFLOPS. 10609

Half Precision GFLOPS **166**

Tensor Cores 0

Compare this with an RTX GPU like the 2080 Ti

RTX 2080Ti

Single Precision GFLOPS. 11750

Half Precision GFLOPS **23500**

Tensor Cores 544

In AMP/Mixed Precision many of the operations are taking place at half-precision, this gives you incredible memory savings and speed.

I tested this with some models tonight on my 1080Ti’s (4 using DataParallel), and I was blown away. Epochs complete 30% faster, and I could almost double my batch size. So I don’t understand why it wasn’t dogged slow since the half-precision GFLOPS on a 1080Ti are abysmal. Is AMP doing some sort of trickery, such as using 16bit and stuff two numbers into a 32bit register or something?