Hi, i am using apex to try to fit a larger model (T5-large) on a single K40 GPU. i understand K40 is using the kepler gpu, so most of the benefits of apex will not be usable on the K40. However, i read that is it still possible to use fp16 to fit a larger model so i went ahead with that. I managed to start training with a larger model (T5-large) which otherwise would not be possible with the GPU’s specs.
My problem is that when when using opt-levels
O0 (fp32 training) the batch losses are fine (not nan). When i change to mixed precision training
O2, the batch losses start to output as
nan. I also tried with smaller models (T5-base and T5-small) where the same thing happened; training proceeded totally normally with
O0 (normal loss values) but had nan losses for
I’ve checked all the inputs, no nans or infs.
Anyone has any idea what’s the problem here? thank you in advance!
The losses should not get an NaN values, while the gradients might encounter NaNs, which will reduce the loss scaling.
Could you check the models operations for custom ops, which might require FP32 precision due to overflow in FP16?
Also, if you just want to save memory, you could try to use
torch.utils.checkpoint to trade compute for memory.
Hi, thanks for your help. Yes, there was overflow in gradients for fp16 like you mentioned. I ended up using utils.checkpoint and was able to fit a larger model. Could I also trouble you one more time?
I ran into this error while using
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
It happened during the
loss prints as
My summarised code snippet looks like this:
outputs = checkpoint(self.t5, kwargs['input_ids'], kwargs['attention_mask'], None,\
decoder_input_ids , kwargs['decoder_attention_mask'])
loss_fct = CrossEntropyLoss(ignore_index=-100)
loss = loss_fct(outputs.view(-1, outputs.size(-1)), lm_labels.view(-1))
I tried following checkpoint tutorial you mentioned in another thread but I couldn’t figure out what went wrong. sorry to bother you again!
GitHub is unfortunately down at the moment, but if I remember correctly, this issue could be solved by passing a dummy input with
requires_grad=True to the checkpointed model?
PS: You could also check out the nightly binaries with native amp support
Thanks! I found this link and followed the same method