I have a network that is trying to optimize a tensor x as well as the gradient of that tensor wrt another tensor r, eg dx/dr.
y = torch.autograd.grad(x,r)
when I run this without DDP, I can minimize both quantities as part of the loss. When I use DDP on 2 gpus, the tensor x shows the same loss value on both gpus, but the gradient tensor y has different values:
# training ... Nx = 0 Ny = 0 x_err = 0 y_err = 0 training_sampler.set_epoch(epoch) for batch in training: x_loss,y_loss,nx,ny = batch.loss(model,cuda) # runs through network loss = x_loss/nx + y_loss/ny Nx += nx Ny += ny optimizer.zero_grad() loss.backward() # will sync? optimizer.step() x_err += reduce(x_loss).item() y_err += reduce(y_loss).item() print (x_err/Nx, y_err/Ny)
def reduce(T): dist.all_reduce(T, op=dist.ReduceOp.SUM) T /= float(dist.get_world_size()) return T
27140.88438, 35.89582 # gpu 0 27140.88438, 37.94447 # gpu 1
I am extremely new to parallel computing and honestly the behind the scenes seems like magic to me, but it’s my understanding that loss.backward() handles the dirty work of syncing the gpus. I’m assuming that my problem is due to the manual torch.autograd. Is there some obvious explanation as to how this works and how I might fix it? Thank you.