I am currently trying to implement a Federated Learning algorithm, and have a question about gradient accumulation for model parameters.
To simulate that the server model must receive gradients from all client model (akin to FedSGD) before each update, I have implemented following code:
optim = optimizer.SGD(network.parameters(), lr=0.001) optim.zero_grad() for client in client_list: # compute g_client as an output of torch.autograd.grad # ... for p, g in zip(network.parameters(), g_client): if p.grad is None: p.grad = g else: p.grad += g for p in network.parameters(): # average the gradient p.grad /= len(client_list) optim.step()
I am wondering if this is a proper way to accumulate the gradients from the different clients before updating my network module? This is my first time doing model backpropagation and weight updates without relying on “loss.backward(); optim.step()”, so I am wondering if there are some under-the-hood mechanisms that I am missing for model training.