However, my model runs in a distributed environment (ddp) and have several sub-modules holding all of which for a forward-backward cycle each of which will decrease the size of batch I can fit for an update. But for each forward/backward, not all of these are active and therefore are not required to stay in GPU (as in I can load these for a single-batch) and otherwise leave it in the non-gpu memory.
In other words, I have parameters Theta + theta[t] (for t=1…T), where t is a particular task. I want to only load a single theta[t] for a forward and backward pass into the GPU and fit larger batches. Currently I’m holding all theta[t] in the GPU.
Is it possible to use the same semantics if it’s the same (sub)-module (theta[t]) to achieve the intention described above?
Hey @jerinphilip, I believe this is possible. You can use Tensor.to(device) to move the parameters to the GPUs in the forward pass, and the to (i.e., copy) operator should be added into the autograd graph, so that the backward pass will compute gradients for the original on-CPU parameters properly. Let me know if it didn’t work.
Note that, although this can reduce the footprint on GPU memory, DDP would still need to communicate the same amount of parameters, as that is determined at DDP construction time. And as those parameters are on CPU, you won’t be able to use NCCL which might cause considerable slow down.
Where do I obtain details corresponding to this particular information? Isn’t only .grad meant to be communicated and the workers applying the updates individually? If my parameters of theta[t] has only gradients for the particular task, would this help the case? I’m reading the Forward Pass section of Internal Design, with find_unused_parameters, it is possible to operate on a subgraph, correct(?). I already have this enabled.
Where do I obtain details corresponding to this particular information?
We need to go through some internal approval process to publicly disclose that paper. It will take some time. For now https://pytorch.org/docs/master/notes/ddp.html is the best place for overall intro. The implementation of DDP is linked below:
Isn’t only .grad meant to be communicated and the workers applying the updates individually?
No. Currently at construction time, DDP creates a mapping from parameters to buckets, and always communicate all buckets even if some gradients are not used in one iteration. The reason for doing so is that it is possible process 1 only computes grad A and process 2 only computes grad B. However, AllReduce operation requires all processes to provide the same set of input tensors. So in this case, both process 1 and 2 need to communicate grad A and B. DDP can use another communication to first figure out which grads are used globally. However, if block waiting for this signal, there will be no overlap between communication and computation, which could result in >30% slowdown in some cases.
If my parameters of theta[t] has only gradients for the particular task, would this help the case?
It helps to skip computation but not communication. DDP always communicates all parameters in the model you passed to DDP constructor.
I’m reading the Forward Pass section of Internal Design , with find_unused_parameters , it is possible to operate on a subgraph, correct(?)
That flag only allows DDP to skip waiting for grads of those parameters. The communication phase is the same regardless the value of find_unused_parameters.