Question about single node multi gpu set-up/ DDP questions

hexken · June 1, 2022, 7:39pm

I have a single model that I would like to train on a single node with N gpus. Currently I am using DDP, however, I suspect I may not be using it in an ideal way, or, perhaps a different method of distributing the computation would be better for my use case.

Each process gets a subset of a fixed task set. For each of these tasks, the process uses the model (inside a no_grad context manager) to generate a variable number of training examples which are then used to compute a loss and run a backprop. Each process accumulates gradients for B tasks (recall each task generates a variable number of training examples) before calling optimizer.step().

To reduce synching overhead I am currently computing the losses with model.requires_background_grad_sync=False for all examples except the last example generated from every Bth task.

First, I am wondering if this is even a good use case for DDP, given that I am explicitly turning off the synching and only require it, on average, once every B*(average # of examples generated for each task) calls to backward()?

Second, do I actually need to call forward() AND backward() (with required_background_grad_sync=True) on each submodule to have that particular submodule synch’d.

Third, Is it okay to have requires_background_grad_sync=True, call forward() on a submodule, set requires_background_grad_sync=False immediately after, then call forward() and backward() on some tensors depending on the outputs of this submodule, then set requires_background_sync=True and call forward() and backward() once more on Tensors depending on the outsput of the submodule. Will this submodule be properly synch’d? with the gradients accumulated over all of these backward() calls?

wanchaol · June 7, 2022, 5:15am

Thanks for posting the question @hexken After reading your question, I got lost on what you are trying to do here. What are B and requires_background_grad_sync mean?

In general, I think if your program does not match DDP pattern, i.e. there’s no data paralellism as it seems you are generating variable length of data instead of batched data, and don’t do gradient sync after every backward, you probably don’t want to go with DDP, unless there’s a specific reason. In your case you can just write a multiprocess training script, where you manages the data and the gradient sync/update by calling allreduce whenever you need to.