Hi, i am using apex to try to fit a larger model (T5-large) on a single K40 GPU. i understand K40 is using the kepler gpu, so most of the benefits of apex will not be usable on the K40. However, i read that is it still possible to use fp16 to fit a larger model so i went ahead with that. I managed to start training with a larger model (T5-large) which otherwise would not be possible with the GPU’s specs.
My problem is that when when using opt-levels O0 (fp32 training) the batch losses are fine (not nan). When i change to mixed precision training O1, O2, the batch losses start to output as nan. I also tried with smaller models (T5-base and T5-small) where the same thing happened; training proceeded totally normally with O0 (normal loss values) but had nan losses for O1 and O2.
I’ve checked all the inputs, no nans or infs.
Anyone has any idea what’s the problem here? thank you in advance!
Hi, thanks for your help. Yes, there was overflow in gradients for fp16 like you mentioned. I ended up using utils.checkpoint and was able to fit a larger model. Could I also trouble you one more time?
I ran into this error while using utils.checkpoint:
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
It happened during the loss.backward() function. loss prints as tensor(0.1663, device='cuda:0)
GitHub is unfortunately down at the moment, but if I remember correctly, this issue could be solved by passing a dummy input with requires_grad=True to the checkpointed model?
PS: You could also check out the nightly binaries with native amp support