Manually manipulating model gradients and updating parameters

Currently i have two instances of the same model. modelA and modelB. with different optimizers

i have a training loop where a forward pass happens on each model and gradients are accumulated. (No optim.step() yet).

Now before doing the backward pass i want to sum up the gradients of each model
modelA.gradients = modelA.gradient + modelB.gradients
modelB.gradients = modelA.gradients + modelB.gradients

optimA.step()
optimB.step()

How can i do this in pytorch??

Thanks

More details

    # Zero grads
    optim1.zero_grad()
    optim2.zero_grad()

    # Model 1
    for batch_idx, (X, y_true) in enumerate(dl1):
        X = X.to(device)
        y_true = y_true.to(device)
        y_pred = model_p1(X)
        loss1 = loss_1(y_pred, y_true)
        loss1.backward()

   # Model 2
    for batch_idx, (X, y_true) in enumerate(dl2):
        X = X.to(device)
        y_true = y_true.to(device)
        y_pred = model_p2(X)
        loss2 = loss_2(y_pred, y_true)
        loss2.backward()

   # Combine (SUM gradients)
    for pA, pB in zip(model_p1.parameters(), model_p2.parameters()):
        sum_grads = pA.grad + pB.grad
        pA.grad = sum_grads
        pB.grad = sum_grads.clone()

   # Update model parameters using summed gradients
    optim1.step()
    optim2.step()

This is what i currently have but it dosen’t seem to be working

for pA, pB in zip(model_p1.parameters(), model_p2.parameters()):
        sum_grads = pA.grad + pB.grad
        pA.grad = sum_grads
        pB.grad = sum_grads.clone()

I want to update parameters based on summed gradients.

Thanks

What’s the issue you are seeing using this approach?

The model isn’t training. My training loss stays constant

Could you check (some) parameter values before the optimizer.step() is called and compare them to the values afterwards? Do you see any changes or are they static?
In the latter case, are you sure you’ve passed these parameters to the optimizer?

Are the models initialized with the same weights?
Also, what is the motivation to follow this approach?

Models are initiallised with random weights, the motivation behind this is i’m using this for a federated learning environment, in this case datasets for models are subsets of a bigger dataset.

I might be completely wrong, as I have not worked in this domain. But I think you should initialize the models with the same weights. When the models are initialized differently/randomly, the order of filters/weights don’t align between them.
Then, When the gradients of non-aligned weights are added at the end, the resulting summed gradients might just be the corrupted version of original gradients/noise and may not provide any useful signal to the weight update step.

Nah i believe we should get similar convergence to optimum however i can see that parameter values are changing and will try and train for a lot longer