Half vs Full Precision with CUDA

I am comparing a half precision workload (tensors are torch.HalfTensor) to a full precision workload (tensors are torch.FloatTensor) by using a command line profiling tool from NVIDIA which reports FLOPS.

I noticed that even when running the model with half precision tensors, the tool reports that only single precision operations happened, albeit many less FLOPS than occured with the full precision model.

My suspicion is that there are some bitpacking optimizations made occuring somewhere in the pytorch/cuda library bindings.

I’d like to confirm this, or if thats not right, hear what the explanation is. Thanks!

Are you using gpus that support half precision like the 20xx series nvidia? If not then, the number of glop between half and full should be the same since you still have to use the same number of registers in the gpus.

Yes it’s a Jetson Tegra X1, so I believe it supports half
Interestingly, I am seeing a similar number of flops when comparing half precision to full precision (actually a few more for half precision than full precision) but ~1/2 the amount of bytes are being fetched from memory

I’m sorry. Gflop is the count the number of operations. half cuda is about the actual size of the data. The number of ops should be the same

The number of flops I am seeing is a little bit higher on half precision than full precision, and the bytes fetched from memory is half (which makes sense).
My leading theory for the discrepancy (which I am looking for more information about), is that the half precision data are being bit packed onto 32-bit data to be sent on chip. Then there are operations which unpack the data so they can be operated on as 16-bit once they are on chip?

you are seeing higher gflop on half precision because of conversion ops where 32 bit data are converted into 16.

Also answered here.