Hi! I’m running into a very strange problem: My model trains fine without mixed precision mode, however, as soon as I turn on the fp16 mixed precision mode the DDP training will break due to all of the model’s weights not receiving gradients, except for norm layers and manually registered nn.Parameters
.
The errors I got:
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new o
ne. This error indicates that your module has parameters that were not used in producing loss. Y
ou can enable unused parameter detection by passing the keyword argument `find_unused_parameters
=True` to `torch.nn.parallel.DistributedDataParallel`, and by
making sure all `forward` function outputs participate in calculating loss.
If you already have done the above, then the distributed data parallel module wasn't able to loc
ate the output tensors in the return value of your module's `forward` function. Please include t
he loss function and the structure of the return value of `forward` of your module when reportin
g this issue (e.g. list, dict, iterable).
Parameters which did not receive grad for rank 1: model.blocks.5.patches_cross_attn_ff.net.3.bia
s, model.blocks.5.patches_cross_attn_ff.net.3.weight,
... omit for space
I also tried the same training script with single-GPU, i.e. no DDP, and have the same no gradient problem (the training doesn’t crash as there’s no DDP, but the loss certainly reflects that only partial model is updated).
Why would mixed precision training cause this? Any suggestions?
Thanks!
UPDATE: it seems there’s something with torch.no_grad()
- I have a torch.no_grad()
context in each iteration where the model calculates some target which will be fed into the model outside the context, if I remove this block of code and feed some dummy values as the target instead, the model can receive gradients properly under mixed precision training. But I don’t know how to make of this, any suggestions?