Gradient hooks with DataParallel and DDP

Hi everyone,

I’d appreciate some help/suggestions on this issue:
For simplicity, let’s say I have a model with only Conv2d layer in it. I would like to store gradient of the weight param of Conv2d layer over time, so I’ve registered a hook on it. When one batch of data is processed and loss computed, backward pass will start gradient computation and my hook will be triggered. In the hook function, I can store the gradient (for example in a list).

Now my problem is to do the same thing but with DataParallel model. If my hooks looks like this:

def my_gradient_hook(grad):

the only print that I see is: cuda:0. What happened to prints/hooks on other GPUs, because each GPU should run a backward pass in parallel and compute gradients?

If I try the same thing with DDP, I get the desired behavior. Prints that I see are: cuda:0, cuda:1, cuda:2

I guess it has something to do with multiprocessing, since DDP spawns multiple processes and DataParallel only one process with multiple threads. I guess I should somehow gather gradients from different threads in the hook function?


Seems like this question had come up before. Seems like this limitation is intrinsic to DataParallel, e.g. see Yanli_Zhao answer here:

Also see warnings here

Hi, thanks a lot for the link. I have already seen this warning and discussion, and (if I understood correctly) they are focused on updating parameters/buffers on non-master GPU. I understand that all modifications will be discarded by destroying these replicas and I’m fine with that. My goal is not to update any states, it’s just to print gradients at each device.