Model weight cannot update when gradients are computed from distributed machines

I am trying to train the global model by receiving gradients from distributed devices. But, the model’s weight is not updated. I have no idea…Thanks in advance.

mod = Model()
cp = torch.load(rep)
mod.load_state_dict(cp[’ model_state_dic’ ])
optimizer = optim.Adam(mod.parameters(), lr=0.001)
optimizer.load_state_dict(cp[‘optimizer_state_dict’])
G = ComputeGradient()
mod.train()
for u in range(len(G)):
optimizer.zero_grad()
loss = Variable(G[u], requires_grad=True)

loss.backward()
optimizer.step()

Hi,

The gradient history is based on Variables. So if you make the loss a Variable at the last moment, the only history you have is empty.
At the moment the autograd engine only works with Tensors that are within the current process. Adding support for distributed is currently being done. You can find the PR with all the doc here.

Hi,
Thanks for the reply. I am not sure if i totally got it. Instead of PR, is there any cheap trick to solve this problem? Thanks in advance.

If there was, we wouldn’t have made a whole package to support this :slight_smile:
The PR I linked corresponds to the doc for functions that are already in pytorch master.

You can try and emulate a similar thing for your particular use case. But that would require calling backward yourself on each node one after the other, after having transfered the gradients for the shared Tensors.
This can be tricky to do.