Pytorch on V100 GPU

Yes, that is the patch for master.
In case of fp16, optimizer is created with fp32 copy of parameters, but gradients are accumulated to model parameters. If .zero_grad() was called on opt, it would zero only copies, not the accumulated gradients. To avoid this clunkiness, one would have to have “mixed precision” optimizer that would handle behind the scenes fp32 param copy that is now created and handled explicitly. That would require either changes to core sgd optimizer, or reimplementing optimizer as a separate module.

1 Like