Yes.
What confuses me here is that it seems that the forward
function of TrainerNet
is just dummy and all what it does is calling that of ParameterServer
.
Yes, you are right. In this specific case, as x
does not require grad, there is no need to link it to the distributed autograd graph. So there are only two pieces of autograd graph on PS and the trainer. (I was wrong when saying there three pieces in previous comments.)
As far as I understand, in the PS architecture, data never leaves the trainer machine and that the whole gradient computation process should be done entirely locally on the trainer machine.
There are different ways to implement this. Imagine there is a super large embedding table and the trainer only holds a several lookup indices in each iteration. One solution is to do training all on trainer, but then the application will need to implement update functions that converts indices and gradients from the trainer back to the embedding table gradients. Another option is let autograd engine taking care of this, and simply calling loss.backward()
on trainer will be sufficient to update embedding table on ps.
If you can describe all the communication that happens in one training iteration, that would be great. For instance, assume that we have one PS machine and two trainer machines. PS has the model and each trainer has a few data samples. What is sent to whom?If you can describe all the communication that happens in one training iteration, that would be great. For instance, assume that we have one PS machine and two trainer machines. PS has the model and each trainer has a few data samples. What is sent to whom?
Sure. Since trainers are independent in that tutorial IIUC, I will only describe what happens between a PS-trainer pair.
In forward pass, there are two comms: 1) trainer -> ps to send input sample 2) ps -> trainer to send the output
In the backward pass, there is one comm: trainer -> ps to send the gradients for the model outputs, which will then trigger local autograd engine on the ps to compute gradients on the model.
In the optimizer step pass, there is one comm: trainer -> ps tell the local optimizer on ps to update model parameters. (It is possible to pack this into the comm during the backward pass using hooks.)
Since there are two trainers accessing the same model, instead of storing the grads in param.grad
, the ps will put those grads in dedicated contexts associated with each distributed autograd context, and those grads will later be consumed by the distributed optimizer.
More details about dist autograd can be found here: https://pytorch.org/docs/stable/rpc/distributed_autograd.html#distributed-autograd-design
Hogwild! assumes shared memory so, the setup is inherently different from that of the PS, right? I cannot entirely digest how/why do you blend these two setups. Would you please clarify?
Right, the original paper was mainly focusing on shm. But the lock-free spirit can be apply to distributed training as well. This is especially useful for training using large dataset with large embedding tables.