Pytorch on V100 GPU

jphoward · December 13, 2017, 4:03am

That is very helpful! To clarify, this is the patch on master, right?: https://github.com/pytorch/examples/compare/master...csarofeen:fp16_examples_cuDNN-ATen

One slightly clunky piece is that you’re calling model.zero_grad instead of opt.zero_grad. Why is this? What would it take to allow us to always just call model.zero_grad (i.e. in both half or single precision paths)?

ngimel · December 13, 2017, 5:45am

Yes, that is the patch for master.
In case of fp16, optimizer is created with fp32 copy of parameters, but gradients are accumulated to model parameters. If .zero_grad() was called on opt, it would zero only copies, not the accumulated gradients. To avoid this clunkiness, one would have to have “mixed precision” optimizer that would handle behind the scenes fp32 param copy that is now created and handled explicitly. That would require either changes to core sgd optimizer, or reimplementing optimizer as a separate module.