CNN fp16 slower than fp32 on Tesla P100

ptrblck · January 10, 2020, 7:49am

What kind of operations are you using and which cudnn version in particular?
If you are using cudnn 7.3 and later, convolutions should use TensorCores for FP16 inputs.
GEMMs (e.g. used in linear layers) however have a size restriction of multiples of 8. For matrix A x matrix B, where A has size [I, J] and B has size [J, K], I, J, and K must be multiples of 8 to use TensorCores. This requirement exists for all cublas and cudnn versions.

Also, could you try to use torch.backends.cudnn.benchmark = True at the beginning of your script?