Gradient failure in torch.nn.parallel.DistributedDataParallel

jwyao · August 15, 2021, 9:51pm

Hi @mrshenli , thanks for your reply. Setting find_unused_parameters=True generated another error.

Actually, I figured out the issue later. The problem was that the computational graphs on different workers were different. I was training a graph neural networks on heterogeneous graphs. At each iteration, the graph batch sampled on each machine is different and may miss some edge types due to random sampling. This will lead to the stated problem: the weights associated to a specific edge type that isn’t sampled will not be computed in the backprop, making the gradient updates across different machines asynchronous.

This problem can happen with other networks whose computational graph is random. To me, it is a good idea to make DDP handle this case without hard matching gradient updates from each worker. If a parameter doesn’t appear on one worker, the update for the parameter from this machine is just zero.