How to Do Semi-Asynchronous or Asynchronous Training with Pytorch

When PyTorch is used for distributed training, DDP is normally good enough for most situations. However, when if performance of different nodes differs, the performance of the whole training will be decided by the worst node. E.g. worker 0 needs 1 second for a forward and backward pass while worker 1 needs 2 seconds, the time for one step will be 2 seconds.

So I am wondering if there is way to do semi-asynchronous training with Pytorch?

There is a similar library called hivemind, but it is designed for Internet while we prefer to run the training job in our cluster.

Hi Siyuan , thanks for your question.

PyTorch distributed package offers a Remote Procedure Call (RPC) API that you could use to implement your asynchronous training algorithm.

Here’s an example of how to implement a parameter server using RPC: Implementing a Parameter Server Using Distributed RPC Framework — PyTorch Tutorials 1.12.0+cu102 documentation

Let us know if that makes sense for you.