No gradient received in mixed precision training

Hi! I’m running into a very strange problem: My model trains fine without mixed precision mode, however, as soon as I turn on the fp16 mixed precision mode the DDP training will break due to all of the model’s weights not receiving gradients, except for norm layers and manually registered nn.Parameters.

The errors I got:

RuntimeError: Expected to have finished reduction in the prior iteration before starting a new o
ne. This error indicates that your module has parameters that were not used in producing loss. Y
ou can enable unused parameter detection by passing the keyword argument `find_unused_parameters
=True` to `torch.nn.parallel.DistributedDataParallel`, and by 
making sure all `forward` function outputs participate in calculating loss. 
If you already have done the above, then the distributed data parallel module wasn't able to loc
ate the output tensors in the return value of your module's `forward` function. Please include t
he loss function and the structure of the return value of `forward` of your module when reportin
g this issue (e.g. list, dict, iterable).
Parameters which did not receive grad for rank 1: model.blocks.5.patches_cross_attn_ff.net.3.bia
s, model.blocks.5.patches_cross_attn_ff.net.3.weight,

... omit for space

I also tried the same training script with single-GPU, i.e. no DDP, and have the same no gradient problem (the training doesn’t crash as there’s no DDP, but the loss certainly reflects that only partial model is updated).

Why would mixed precision training cause this? Any suggestions?

Thanks!

UPDATE: it seems there’s something with torch.no_grad() - I have a torch.no_grad() context in each iteration where the model calculates some target which will be fed into the model outside the context, if I remove this block of code and feed some dummy values as the target instead, the model can receive gradients properly under mixed precision training. But I don’t know how to make of this, any suggestions?

Could you post the corresponding code causing the issue?

After some digging I found out this thread:

which describes exactly the same issue. I used the solution in my code and the training works properly now! Although I kinda agree that this is more like bug rather than missed practice, as torch.no_grad() can happen anywhere inside the model while we typically wrap the entire forward pass at the most outter layer. Also, I was using Huggingface’s accelerate launch tool which achieves the mixed precision training by “preparing” the entire model too so it took me a long time to ping point the torch.no_grad() part that happened somewhere inside my model.