DistributedDataParallel gradient print

Hi all! New to pytorch and i am using pytorch to do distributed training. I knew that ‘DistributedDataParallel’ averages gradients between each process. I want to know that whether i could print the gradient before average for every process ?

Hi! This is possible. DDP relies on torch.autograd.backward to accumulate the gradients into the grad tensor of the model parameters. There is a functional alternative in torch.autograd.grad that doesn’t accumulate at all. If you’re interested in the local gradients, instead of running loss.backward(), you can run torch.autograd.grad(loss, model.parameters()) and get back a list of gradient tensors, one for every model parameter. This doesn’t accumulate them into the grad tensor of the model parameter, so it doesn’t kick off DDP. If you want to run DDP afterwards anyway, make sure to pass the retain_graph=True kwarg to torch.autograd.grad. I haven’t tried any of this out, but in theory it should work.

Thanks a lot ! Your advice is really helpful ! I got the local gradients and it seems that DDP is not affected at all ! Its very cool that i can get professional guidance at pytorch’s forum.

2 Likes

I’m glad you were able to continue! :smiley:

Hi, pietern @pietern , the “DistributedDataParallel” automatically average the gradient when calling “loss.backward()”,
But I didn’t find the corresponding script about how to get all the gradient of nodes and average the gradients during backward in pytorch source code, Do you know where it is ?