I’d appreciate some help/suggestions on this issue:
For simplicity, let’s say I have a model with only Conv2d layer in it. I would like to store gradient of the weight param of Conv2d layer over time, so I’ve registered a hook on it. When one batch of data is processed and loss computed, backward pass will start gradient computation and my hook will be triggered. In the hook function, I can store the gradient (for example in a list).
Now my problem is to do the same thing but with DataParallel model. If my hooks looks like this:
the only print that I see is: cuda:0. What happened to prints/hooks on other GPUs, because each GPU should run a backward pass in parallel and compute gradients?
If I try the same thing with DDP, I get the desired behavior. Prints that I see are: cuda:0, cuda:1, cuda:2…
I guess it has something to do with multiprocessing, since DDP spawns multiple processes and DataParallel only one process with multiple threads. I guess I should somehow gather gradients from different threads in the hook function?
Hi, thanks a lot for the link. I have already seen this warning and discussion, and (if I understood correctly) they are focused on updating parameters/buffers on non-master GPU. I understand that all modifications will be discarded by destroying these replicas and I’m fine with that. My goal is not to update any states, it’s just to print gradients at each device.