Hi all! New to pytorch and i am using pytorch to do distributed training. I knew that ‘DistributedDataParallel’ averages gradients between each process. I want to know that whether i could print the gradient before average for every process ?
Hi! This is possible. DDP relies on
torch.autograd.backward to accumulate the gradients into the
grad tensor of the model parameters. There is a functional alternative in
torch.autograd.grad that doesn’t accumulate at all. If you’re interested in the local gradients, instead of running
loss.backward(), you can run
torch.autograd.grad(loss, model.parameters()) and get back a list of gradient tensors, one for every model parameter. This doesn’t accumulate them into the
grad tensor of the model parameter, so it doesn’t kick off DDP. If you want to run DDP afterwards anyway, make sure to pass the
retain_graph=True kwarg to
torch.autograd.grad. I haven’t tried any of this out, but in theory it should work.
Thanks a lot ! Your advice is really helpful ! I got the local gradients and it seems that DDP is not affected at all ! Its very cool that i can get professional guidance at pytorch’s forum.
I’m glad you were able to continue!
Hi, pietern @pietern , the “DistributedDataParallel” automatically average the gradient when calling “loss.backward()”,
But I didn’t find the corresponding script about how to get all the gradient of nodes and average the gradients during backward in pytorch source code, Do you know where it is ?