Model weight cannot update when gradients are computed from distributed machines

jia · November 11, 2019, 6:45pm

I am trying to train the global model by receiving gradients from distributed devices. But, the model’s weight is not updated. I have no idea…Thanks in advance.

mod = Model()
cp = torch.load(rep)
mod.load_state_dict(cp[’ model_state_dic’ ])
optimizer = optim.Adam(mod.parameters(), lr=0.001)
optimizer.load_state_dict(cp[‘optimizer_state_dict’])
G = ComputeGradient()
mod.train()
for u in range(len(G)):
optimizer.zero_grad()
loss = Variable(G[u], requires_grad=True)

loss.backward()
optimizer.step()

albanD · November 11, 2019, 8:06pm

Hi,

The gradient history is based on Variables. So if you make the loss a Variable at the last moment, the only history you have is empty.
At the moment the autograd engine only works with Tensors that are within the current process. Adding support for distributed is currently being done. You can find the PR with all the doc here.

jia · November 13, 2019, 8:50am

Hi,
Thanks for the reply. I am not sure if i totally got it. Instead of PR, is there any cheap trick to solve this problem? Thanks in advance.

albanD · November 13, 2019, 4:22pm

If there was, we wouldn’t have made a whole package to support this
The PR I linked corresponds to the doc for functions that are already in pytorch master.

You can try and emulate a similar thing for your particular use case. But that would require calling backward yourself on each node one after the other, after having transfered the gradients for the shared Tensors.
This can be tricky to do.