Division by zero with apex after 217 epoch

krishansubudhi · February 20, 2020, 7:48am

Hi I am training a Bert model with amp O2 using 64 gpus . It trains successfully for 217 epochs but at 218th epoch, the loss scale goes down rapidly and model fails with division by zero error. Run does not fail when apex is disabled.

What can be the problem here?

Printed the gradients and Max param after backward . Max param is 11, gradients are Nan. Why does Nan appear ?

ptrblck · February 20, 2020, 7:54am

Are you seeing a decreasing loss scaler after 217 epochs or which operation triggers the division by zero error?
If the loss scaling is going down rapidly, your model output or loss might be an invalid value (NaN or Inf) and thus all steps are skipped until the loss scaler dies.

It’s strange that this issue happens after so many epochs, so I assume your input data is clean.
Could you check the max and min absolute value of the activations during the training?

krishansubudhi · February 20, 2020, 8:17am

It’s after few steps in 217 epochs. Loss is not NaN.

SystemLog: 2020-02-19 06:18:40,416:DEBUG : transformers_pretraining.trainer.apexDDP : 26 : Enabling all reduce
SystemLog: 2020-02-19 06:18:40,417:DEBUG : transformers_pretraining.trainer.apexDDP : 138 : ***** Training step 1532 *****
SystemLog: 2020-02-19 06:18:40,417:DEBUG : transformers_pretraining.utils : 47 : Inside <function Singleprocess._forward at 0x7f7266673840>
SystemLog: 2020-02-19 06:18:40,417:DEBUG : transformers_pretraining.utils : 48 : torch.cuda.get_device_properties(0).total_memory = 16914055168, torch.cuda.memory_allocated() = 5356990464
SystemLog: 2020-02-19 06:18:40,469:DEBUG : transformers_pretraining.trainer.apexDDP : 45 : loss scale = 64.0, loss = 0.6044921875
SystemLog: 2020-02-19 06:18:40,469:DEBUG : transformers_pretraining.trainer.apexDDP : 47 : scaled loss = 38.6875
model , optimizer max grad before clipping nan, nan
model , optimizer max grad after clipping nan, nan
max optimizer parameter : 11.71293830871582
SystemLog: 2020-02-19 06:18:41,270:DEBUG : transformers_pretraining.trainer.apexDDP : 119 : model
module.bert.embeddings.word_embeddings.weight = tensor([-0.0005, -0.0307, 0.0093, 0.0120, -0.0311], device=‘cuda:0’,
dtype=torch.float16, grad_fn=), grad = tensor([nan, nan, nan, nan, nan], device=‘cuda:0’, dtype=torch.float16) , sum = nan
SystemLog: 2020-02-19 06:18:41,272:DEBUG : transformers_pretraining.trainer.apexDDP : 125 : optimizer
tensor([-0.0005, -0.0307, 0.0093, 0.0120, -0.0311], device=‘cuda:0’,
grad_fn=) tensor([nan, nan, nan, nan, nan], device=‘cuda:0’) torch.float32 tensor(nan, device=‘cuda:0’)
max model parameter : 11.7109375
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 32.0

krishansubudhi · February 20, 2020, 8:20am

Code: self._optimizer_step_flag is True every 4th step. GA = 4

def _fp16_backwards(self,loss):

    self.logger.debug(f'loss scale = {_amp_state.loss_scalers[0].loss_scale()}, loss = {loss.item()}')

    with amp.scale_loss(loss,self.optimizer, delay_unscale= not self._optimizer_step_flag) as scaled_loss:

        self.logger.debug(f'scaled loss = {scaled_loss.item()}')

        scaled_loss.backward()

    # Clip gradients only when optimizer is about to step. Fixes divergence issue. 

    # TODO: patch _optimize and move code there.

    # TODO: clip grads for non FP16 training also

    if self._optimizer_step_flag:

        print(f'optimizer norm = {get_norm(amp.master_params(self.optimizer))}')

        print(f'model , optimizer max grad before clipping \

            {get_max_grad(self.model.parameters())}, {get_max_grad(amp.master_params(self.optimizer))}')

        torch.nn.utils.clip_grad_norm_(amp.master_params(self.optimizer), 1.0)

        print(f'model , optimizer max grad after clipping \

            {get_max_grad(self.model.parameters())}, {get_max_grad(amp.master_params(self.optimizer))}')

        print(f'max optimizer parameter : {max([torch.max(p).item() for p in amp.master_params(self.optimizer)])}')

krishansubudhi · February 21, 2020, 1:25am

Checked activation and loss. None of them are Nan or inf.

11118 · April 23, 2020, 9:35am

Hi, have you solved this problem? I met same error when training BERT with amp O2 after about 200k iterations and training XLM-R with amp O1 after about 10k iterations.

ptrblck · April 23, 2020, 9:47am

We recommend to try out torch.cuda.amp, which is shipped in the current master and the nightly binaries.
Have a look at this post for more information.

krishansubudhi · April 24, 2020, 1:00am

I got the same error with torch.cuda.amp too. If you are using huggungface Bert or xlm models, remove the Pooler layer from the optimizer. That solved the issue for us in O2.

111429 · April 2, 2021, 2:53pm

Hello, I have the similar issue. I use huggingface Bart. What does it mean “remove the Pooler layer from the optimizer”?Thank you.