Yes, that is the patch for master.
In case of fp16, optimizer
is created with fp32 copy of parameters, but gradients are accumulated to model
parameters. If .zero_grad() was called on opt
, it would zero only copies, not the accumulated gradients. To avoid this clunkiness, one would have to have “mixed precision” optimizer that would handle behind the scenes fp32 param copy that is now created and handled explicitly. That would require either changes to core sgd optimizer, or reimplementing optimizer as a separate module.
1 Like