CNN fp16 slower than fp32 on Tesla P100

(Jiaming Sun) #1

Hi there, I’m testing with fp16 features of pytorch with a benchmark script provided here, getting these result(all with CUDA8 and cuDNN6):

➜  ~ python
Titan X Pascal(Dell T630, anaconda2, pytorch 0.3.0):
FP32 Iterations per second:  1.7890313980917067
FP16 Iterations per second:  1.8345766566297141

Tesla P100(DGX-1, anaconda3, pytorch 0.3.0):
('FP32 Iterations per second: ', 2.001722807676548)
('FP16 Iterations per second: ', 1.9109340821222125)

Tesla P100(DGX-1, pytorch docker image provided by NVIDIA, pytorch 0.2.0):
FP32 Iterations per second:  1.9826932313239591
FP16 Iterations per second:  1.8854441494961636

For the detailed information of the docker image refer to here, it is theoretically optimized for DGX-1.
It can be seen that on P100 fp16 is even slower than fp32. Am I doing anything wrong? or PyTorch fp16 support is still under heavily developing? If so is there any doc for fp16 support status?

I would like to do more testing if needed, thanks a lot!
And another thanks for this great framework!


on P100 we dont expect FP16 to be any faster, because we disabled FP16 math on P100 (it is numerically unstable). We use simulated FP16, where storage is FP16, but compute is in FP32 (so it upconverts to FP32 before doing operations).