I have a single model that I would like to train on a single node with N gpus. Currently I am using DDP, however, I suspect I may not be using it in an ideal way, or, perhaps a different method of distributing the computation would be better for my use case.
Each process gets a subset of a fixed task set. For each of these tasks, the process uses the model (inside a no_grad context manager) to generate a variable number of training examples which are then used to compute a loss and run a backprop. Each process accumulates gradients for B tasks (recall each task generates a variable number of training examples) before calling optimizer.step().
To reduce synching overhead I am currently computing the losses with model.requires_background_grad_sync=False for all examples except the last example generated from every Bth task.
First, I am wondering if this is even a good use case for DDP, given that I am explicitly turning off the synching and only require it, on average, once every B*(average # of examples generated for each task) calls to backward()?
Second, do I actually need to call forward() AND backward() (with required_background_grad_sync=True) on each submodule to have that particular submodule synch’d.
Third, Is it okay to have requires_background_grad_sync=True, call forward() on a submodule, set requires_background_grad_sync=False immediately after, then call forward() and backward() on some tensors depending on the outputs of this submodule, then set requires_background_sync=True and call forward() and backward() once more on Tensors depending on the outsput of the submodule. Will this submodule be properly synch’d? with the gradients accumulated over all of these backward() calls?