It’s after few steps in 217 epochs. Loss is not NaN.
SystemLog: 2020-02-19 06:18:40,416:DEBUG : transformers_pretraining.trainer.apexDDP : 26 : Enabling all reduce
SystemLog: 2020-02-19 06:18:40,417:DEBUG : transformers_pretraining.trainer.apexDDP : 138 : ***** Training step 1532 *****
SystemLog: 2020-02-19 06:18:40,417:DEBUG : transformers_pretraining.utils : 47 : Inside <function Singleprocess._forward at 0x7f7266673840>
SystemLog: 2020-02-19 06:18:40,417:DEBUG : transformers_pretraining.utils : 48 : torch.cuda.get_device_properties(0).total_memory = 16914055168, torch.cuda.memory_allocated() = 5356990464
SystemLog: 2020-02-19 06:18:40,469:DEBUG : transformers_pretraining.trainer.apexDDP : 45 : loss scale = 64.0, loss = 0.6044921875
SystemLog: 2020-02-19 06:18:40,469:DEBUG : transformers_pretraining.trainer.apexDDP : 47 : scaled loss = 38.6875
model , optimizer max grad before clipping nan, nan
model , optimizer max grad after clipping nan, nan
max optimizer parameter : 11.71293830871582
SystemLog: 2020-02-19 06:18:41,270:DEBUG : transformers_pretraining.trainer.apexDDP : 119 : model
module.bert.embeddings.word_embeddings.weight = tensor([-0.0005, -0.0307, 0.0093, 0.0120, -0.0311], device=‘cuda:0’,
dtype=torch.float16, grad_fn=), grad = tensor([nan, nan, nan, nan, nan], device=‘cuda:0’, dtype=torch.float16) , sum = nan
SystemLog: 2020-02-19 06:18:41,272:DEBUG : transformers_pretraining.trainer.apexDDP : 125 : optimizer
tensor([-0.0005, -0.0307, 0.0093, 0.0120, -0.0311], device=‘cuda:0’,
grad_fn=) tensor([nan, nan, nan, nan, nan], device=‘cuda:0’) torch.float32 tensor(nan, device=‘cuda:0’)
max model parameter : 11.7109375
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 32.0