Tegra X1 half precision data with PyTorch

Once advantage of FP16 operations is that the memory bandwidth will be reduced as only half of the data needs to be transferred to the register for the operation (and back to the global device memory).
Also, TensorCores can be used for suitable operations such as matrix multiplications and convolutions.

I’m not sure why the number of operations should decrease. Could you post a reference for this idea?