Manually modify gradients of two models, average them and put them back in both models!

I am trying to train two models on two mutually exclusive portions of a datasets.
Now while I am training both the models, I want to manually extract the Gradients from Model A and Model B, after forward propagation, then before updating the weights, I want to average both the model’s gradients and put the average of the models in both the models, and update the weights.

I am trying it in the following way:

def mergeGrad(modelA, modelB, i):
    listGradA = [] 
    listGradB = [] 
    itr = 0
    for pA, pB in zip(modelA.parameters(), modelB.parameters()):
        listGradA.append(pA)
        listGradB.append(pB)
        avg = (pA.grad + pB.grad)/2
        pA.grad = avg
        pB.grad = avg
        itr += 1

In the optimization, portion I am doing

       optimizerA.zero_grad()
        lossA.backward()
        
        optimizerB.zero_grad()
        lossB.backward()
        
        print(gradientA, gradientB)
        mergeGrad(modelA, modelB, i)
        print(gradientA, gradientB)

Now, I am printing before and after averaging and putting the Gradient back. However, only in the very first iteration the first print statement outputs separate gradients for modelA and modelB, and right after merging them, I get the averaged gradients as I want.

However, from the second iteration onward, both the models are always generating the same gradients after forward propagation (and hence the same averaged values). How is this possible?
The models are seeing different data!

Is the manual gradient modification that I am doing is wrong?
Please do suggest if there’s a better way to manually modify and put them back into the model!
Thanks a lot!

1 Like

Hi,

I would bet the issue is with:

        pA.grad = avg
        pB.grad = avg

Here you set the same Tensor to be the gradient for both parameters. So the backward passes will accumulate in the Same Tensor.
If you don’t want that, you need to add a .clone() for at least one of them.

Note that it would be a neat way to implement what you want as well: make the .grad fields the same so the two backward passes actually accumulate in the same Tensor. And divide your learning rate by 2 so that you still do the same update (assuming simple SGD).

1 Like

Thanks a lot. grad.clone() solved the issue. Also appreciate the neat trick you mentioned!

Hi,
Do you have any efficient averaging approach?
Since if the number of models increases the ‘for loop’ does not make that much good performance based on time…

At the moment there is not much no. If your weights are large enough that shouldn’t be a problem.
But we are working on improving that see the work in progress (so subject to heavy change!!) optimizers: https://github.com/pytorch/pytorch/blob/master/torch/optim/_multi_tensor/sgd.py

1 Like

Can we use clone in line :
avg = (pA.grad + pB.grad)/2

Like:
avg = (pA.grad.clone() + pB.grad.clone())/2

and then simply use that average like:
pA.grad = avg
pB.grad = avg

That doesn’t change the fact that pA.grad is pB.grad and so the same issue will appear.

Okay, thank you so much.

And do we have a way other than manually zipping it for say 10 clients?

What do you mean zip?
The suggestion above was to just replace this to

pA.grad = avg
pB.grad = avg.clone()