ZeroDivisionError: float division by zero when applying "O2"

I’m trying to apply gradient compression with mixed-precision training (amp, mode = “O2” with dynamic loss scale), however I encounter with an error in the middle of training:

  File "/home/anaconda3/lib/python3.7/site-packages/ray/tune/function_runner.py", line 248, in run
    self._entrypoint()
  File "/home/anaconda3/lib/python3.7/site-packages/ray/tune/function_runner.py", line 316, in entrypoint
    self._status_reporter.get_checkpoint())
  File "/home/anaconda3/lib/python3.7/site-packages/ray/tune/function_runner.py", line 575, in _trainable_func
    output = fn()
  File "/home/resnetTraceDDP.py", line 634, in train
    scaled_loss.backward() # calculate the gradients
  File "/home/anaconda3/lib/python3.7/contextlib.py", line 119, in __exit__
    next(self.gen)
  File "/home/anaconda3/lib/python3.7/site-packages/apex/amp/handle.py", line 123, in scale_loss
    optimizer._post_amp_backward(loss_scaler)
  File "/home/anaconda3/lib/python3.7/site-packages/apex/amp/_process_optimizer.py", line 190, in post_backward_with_master_weights
    models_are_masters=False)
  File "/home/anaconda3/lib/python3.7/site-packages/apex/amp/scaler.py", line 119, in unscale
    self.unscale_python(model_grads, master_grads, scale)
  File "/home/anaconda3/lib/python3.7/site-packages/apex/amp/scaler.py", line 88, in unscale_python
    1./scale,
ZeroDivisionError: float division by zero

I’m applying this to ResNet model (ResNet 50, for example), and all the hyperparameter values are reasonable. I’m not entirely sure how to fix this, will changing to static loss scale works? I was also wondering what’s the difference between dynamic loss scaling and static loss scaling, and whether that would lead to difference in run time or accuracy

We recommend to use the native mixed-precision training utility via torch.cuda.amp instead of apex/amp.
You can find some examples here.