Controlling Epochs for distributed dataparallel

While using dataparallel is it possible to run processes with different number of epochs. Say on machine one I would like to run the process for 20 epochs and sync with master however after 20 epochs I would want to run completely on master. Is there a workaround for this? I used one of the samples given in tutorials however in the event that epochs are varying the master waits to sync up though the process has completed on another machine.

DDP instances need to all participate in the backward, otherwise it would hang. But there are work around. If you know that master would run say 100 epochs, and other nodes would run 80 epochs, you can call forward-backward on the DDP instance for 80 epochs. After that, you can delete the DDP instance, which will remove the DDP grad hooks accordingly. Then, you can run forward-backward on DDP.module (as DDP is deleted, you won’t be able to call DDP.module, but the program can still have a reference to the original local module separately) on master, and it will no longer trigger communications.

1 Like