I am currently trying to implement a Federated Learning algorithm, and have a question about gradient accumulation for model parameters.
To simulate that the server model must receive gradients from all client model (akin to FedSGD) before each update, I have implemented following code:
optim = optimizer.SGD(network.parameters(), lr=0.001)
optim.zero_grad()
for client in client_list:
# compute g_client as an output of torch.autograd.grad
# ...
for p, g in zip(network.parameters(), g_client):
if p.grad is None:
p.grad = g
else:
p.grad += g
for p in network.parameters(): # average the gradient
p.grad /= len(client_list)
optim.step()
I am wondering if this is a proper way to accumulate the gradients from the different clients before updating my network module? This is my first time doing model backpropagation and weight updates without relying on “loss.backward(); optim.step()”, so I am wondering if there are some under-the-hood mechanisms that I am missing for model training.