Gradient hooks with DataParallel and DDP

ekurtic · April 12, 2021, 1:58pm

Hi everyone,

I’d appreciate some help/suggestions on this issue:
For simplicity, let’s say I have a model with only Conv2d layer in it. I would like to store gradient of the weight param of Conv2d layer over time, so I’ve registered a hook on it. When one batch of data is processed and loss computed, backward pass will start gradient computation and my hook will be triggered. In the hook function, I can store the gradient (for example in a list).

Now my problem is to do the same thing but with DataParallel model. If my hooks looks like this:

def my_gradient_hook(grad):
    print(grad.device)

the only print that I see is: cuda:0. What happened to prints/hooks on other GPUs, because each GPU should run a backward pass in parallel and compute gradients?

If I try the same thing with DDP, I get the desired behavior. Prints that I see are: cuda:0, cuda:1, cuda:2…

I guess it has something to do with multiprocessing, since DDP spawns multiple processes and DataParallel only one process with multiple threads. I guess I should somehow gather gradients from different threads in the hook function?

agolynski · April 12, 2021, 5:02pm

Hi,

Seems like this question had come up before. Seems like this limitation is intrinsic to DataParallel, e.g. see Yanli_Zhao answer here:

Also see warnings here

github.com

pytorch/pytorch/blob/main/torch/nn/parallel/data_parallel.py#L67


      
          
          .. warning::
              It is recommended to use :class:`~torch.nn.parallel.DistributedDataParallel`,
              instead of this class, to do multi-GPU training, even if there is only a single
              node. See: :ref:`cuda-nn-ddp-instead` and :ref:`ddp`.
          
          Arbitrary positional and keyword inputs are allowed to be passed into
          DataParallel but some types are specially handled. tensors will be
          **scattered** on dim specified (default 0). tuple, list and dict types will
          be shallow copied. The other types will be shared among different threads
          and can be corrupted if written to in the model's forward pass.
          
          The parallelized :attr:`module` must have its parameters and buffers on
          ``device_ids[0]`` before running this :class:`~torch.nn.DataParallel`
          module.
          
          .. warning::
              In each forward, :attr:`module` is **replicated** on each device, so any
              updates to the running module in ``forward`` will be lost. For example,
              if :attr:`module` has a counter attribute that is incremented in each
              ``forward``, it will always stay at the initial value because the update

ekurtic · April 13, 2021, 8:03am

Hi, thanks a lot for the link. I have already seen this warning and discussion, and (if I understood correctly) they are focused on updating parameters/buffers on non-master GPU. I understand that all modifications will be discarded by destroying these replicas and I’m fine with that. My goal is not to update any states, it’s just to print gradients at each device.