Hi there, I’m testing with fp16 features of pytorch with a benchmark script provided here, getting these result(all with CUDA8 and cuDNN6):
➜ ~ python test_pytorch_vgg19_fp16.py
Titan X Pascal(Dell T630, anaconda2, pytorch 0.3.0):
FP32 Iterations per second: 1.7890313980917067
FP16 Iterations per second: 1.8345766566297141
Tesla P100(DGX-1, anaconda3, pytorch 0.3.0):
('FP32 Iterations per second: ', 2.001722807676548)
('FP16 Iterations per second: ', 1.9109340821222125)
Tesla P100(DGX-1, pytorch docker image provided by NVIDIA, pytorch 0.2.0):
FP32 Iterations per second: 1.9826932313239591
FP16 Iterations per second: 1.8854441494961636
For the detailed information of the docker image refer to here, it is theoretically optimized for DGX-1.
It can be seen that on P100 fp16 is even slower than fp32. Am I doing anything wrong? or PyTorch fp16 support is still under heavily developing? If so is there any doc for fp16 support status?
I would like to do more testing if needed, thanks a lot!
And another thanks for this great framework!