Tried resnet50 training with fp16 input on amazon aws V100 instance, following example from https://github.com/csarofeen/examples.git. nvidia docker image pytorch 18.05-py3.
It is about 30% faster comparing with float32 training. I had my expectation set to 500%. Long way to go before the software can use all the hardware potentials.
NVCaffe float16 seems faster than pytorch, but I ran quickly into memory problem since my data set is huge (120GB for the time being, keeps on growing ), and NVCaffe seems mapping all dataset into virtual address space via lmdb.