Trying to optimize the gradient as part of the loss

I have a network that is trying to optimize a tensor x as well as the gradient of that tensor wrt another tensor r, eg dx/dr.

        y = torch.autograd.grad(x,r)

when I run this without DDP, I can minimize both quantities as part of the loss. When I use DDP on 2 gpus, the tensor x shows the same loss value on both gpus, but the gradient tensor y has different values:

# training 
    ...
    Nx = 0
    Ny = 0
    x_err = 0
    y_err = 0

    training_sampler.set_epoch(epoch)
    for batch in training:
        x_loss,y_loss,nx,ny = batch.loss(model,cuda) # runs through network
        loss = x_loss/nx + y_loss/ny
        Nx += nx
        Ny += ny

        optimizer.zero_grad()
        loss.backward() # will sync?

        optimizer.step()
        x_err += reduce(x_loss).item()
        y_err += reduce(y_loss).item()
    
    print (x_err/Nx, y_err/Ny) 

where:

    def reduce(T):
        dist.all_reduce(T, op=dist.ReduceOp.SUM)
        T /= float(dist.get_world_size())
        return T

output:

27140.88438,  35.89582 # gpu 0
27140.88438,  37.94447 # gpu 1

I am extremely new to parallel computing and honestly the behind the scenes seems like magic to me, but it’s my understanding that loss.backward() handles the dirty work of syncing the gpus. I’m assuming that my problem is due to the manual torch.autograd. Is there some obvious explanation as to how this works and how I might fix it? Thank you.

Can you provide a minimum code which we can repro the result? Especially, how you wrap your model with DDP. Thanks!

Thanks. I will try to make a simple example. It’s very likely I’m doing something stupid or just don’t understand how this works.

After much debugging I realized that this is not a bug. The issue was that individually my nx,ny values are different per sample so the gpus were receiving different information. My bad!