Apex: nan loss with O1 and O2

Hi, i am using apex to try to fit a larger model (T5-large) on a single K40 GPU. i understand K40 is using the kepler gpu, so most of the benefits of apex will not be usable on the K40. However, i read that is it still possible to use fp16 to fit a larger model so i went ahead with that. I managed to start training with a larger model (T5-large) which otherwise would not be possible with the GPU’s specs.

My problem is that when when using opt-levels O0 (fp32 training) the batch losses are fine (not nan). When i change to mixed precision training O1, O2, the batch losses start to output as nan. I also tried with smaller models (T5-base and T5-small) where the same thing happened; training proceeded totally normally with O0 (normal loss values) but had nan losses for O1 and O2.

I’ve checked all the inputs, no nans or infs.

Anyone has any idea what’s the problem here? thank you in advance!

The losses should not get an NaN values, while the gradients might encounter NaNs, which will reduce the loss scaling.

Could you check the models operations for custom ops, which might require FP32 precision due to overflow in FP16?

Also, if you just want to save memory, you could try to use torch.utils.checkpoint to trade compute for memory.

1 Like

Hi, thanks for your help. Yes, there was overflow in gradients for fp16 like you mentioned. I ended up using utils.checkpoint and was able to fit a larger model. Could I also trouble you one more time?

I ran into this error while using utils.checkpoint:

RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

It happened during the loss.backward() function. loss prints as tensor(0.1663, device='cuda:0)

My summarised code snippet looks like this:

outputs = checkpoint(self.t5, kwargs['input_ids'], kwargs['attention_mask'], None,\
                                 decoder_input_ids , kwargs['decoder_attention_mask'])
loss_fct = CrossEntropyLoss(ignore_index=-100)
loss = loss_fct(outputs[0].view(-1, outputs[0].size(-1)), lm_labels.view(-1))
...
loss.backward()

I tried following checkpoint tutorial you mentioned in another thread but I couldn’t figure out what went wrong. sorry to bother you again!

GitHub is unfortunately down at the moment, but if I remember correctly, this issue could be solved by passing a dummy input with requires_grad=True to the checkpointed model?

PS: You could also check out the nightly binaries with native amp support :wink:

1 Like

Thanks! I found this link and followed the same method :smiley: