When PyTorch is used for distributed training, DDP is normally good enough for most situations. However, when if performance of different nodes differs, the performance of the whole training will be decided by the worst node. E.g. worker 0 needs 1 second for a forward and backward pass while worker 1 needs 2 seconds, the time for one step will be 2 seconds.
So I am wondering if there is way to do semi-asynchronous training with Pytorch?
There is a similar library called hivemind, but it is designed for Internet while we prefer to run the training job in our cluster.