Gradient failure in torch.nn.parallel.DistributedDataParallel

I got this error message when using DDP with 2 trainers on 2 machines. The training process worked for a few batches and then returned the following message. I was not able to fully figure out what happened in the training process. My understanding is: one worker didn’t return the gradient of some parameters while the other worker started the next iteration. That is, the gradient updates were not synchronized correctly.

I tried to add all training parameters outputs to the loss function and re-run the distributed training. But still received the same error message.

Could anyone help me figure out where the potential problem is?

Note: The problem goes away when worker number = 1.

 File "/home/tiger/usr_name/simgnn-dgl-torch/train_ray.py", line 117, in train_epoch
    adv_out, feature_out, feature_loss = model(block)
  File "/home/tiger/anaconda3/envs/usr_name/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/tiger/anaconda3/envs/usr_name/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 787, in forward
    if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by 
making sure all `forward` function outputs participate in calculating loss. 
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
Parameters which did not receive grad for rank 0: _adv_encoder._ff.ln.2.beta, _adv_encoder._ff.ln.2.alpha, _adv_encoder._ff.layers.0.weight, _adv_encoder._ff.layers.0.bias, _adv_encoder._ff.layers.1.weight, _adv_encoder._ff.layers.1.bias, _adv_encoder._ff.layers.2.weight, _adv_encoder._ff.layers.2.bias, _adv_encoder._ff.ln.0.alpha, _adv_encoder._ff.ln.0.beta, _adv_encoder._ff.ln.1.alpha, _adv_encoder._ff.ln.1.beta
Parameter indices which did not receive grad for rank 0: 11 12 13 14 15 16 19 20 21 22 23 24

Hey @jwyao, have you tried setting find_unused_parameters=True in DistributedDataParallel (DDP) constructor?

The error message says that DDP instance didn’t see gradients for parameters 11 12 13 14 15 16 19 20 21 22 23 24.

You can verify this by running one forward + backward and then loop over all parameters and check if there .grad field is available/updated. Sth like:

model = DistributedDataParallel(model, devices=[rank])
loss_fn(model(inputs)).backward()

for p in model.parameters():
  if p.grad is None:
    print("found unused param")

If above doesn’t work, could you please share a min-repro?

Hi @mrshenli , thanks for your reply. Setting find_unused_parameters=True generated another error.

Actually, I figured out the issue later. The problem was that the computational graphs on different workers were different. I was training a graph neural networks on heterogeneous graphs. At each iteration, the graph batch sampled on each machine is different and may miss some edge types due to random sampling. This will lead to the stated problem: the weights associated to a specific edge type that isn’t sampled will not be computed in the backprop, making the gradient updates across different machines asynchronous.

This problem can happen with other networks whose computational graph is random. To me, it is a good idea to make DDP handle this case without hard matching gradient updates from each worker. If a parameter doesn’t appear on one worker, the update for the parameter from this machine is just zero.

find_unused_parameters was supposed to handle this case. Which error did you see after setting find_unused_parameters=True?

Thanks for your suggestion. When setting find_unused_parameters=True, the following warning was returned
Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters, consider turning this flag off. Note that this warning may be a false positive your model has flow control causing later iterations to have unused parameters. (function operator())

I think the problem is that parameters used in each trainer changes from time to time. It may happen that in some batches some parameters don’t appear in the computational graph due to random graph sampling, and it also can happen that every parameter appears in all trainers if sampling is balanced.

I see. In that case, is it OK to just ignore the above warning?

@mrshenli Thanks for your suggestion! I found an interesting behavior. Previously the model was trained on CPU and the training procedure generated these messages. But once the training is moved to GPU, using find_unused_parameters=True works totally fine without any error or warning.

This is quite strange, as we don’t expect CPU → GPU device change with no model/training side changes to incur unused parameters. This implied there are unused parameters on the GPU.

If you’re curious to dig into it, you can send find_unused=False and TORCH_DISTRIBUTED_DEBUG=DETAIL (need PyTorch 1.9) which will log the unused parameter name in the crash. Then you can check whether this parameter is being used or unused as part of DDP on CPU.