How to collect all the gradients from multiple GPUs



I am trying to run this code:

to visualize saliency map from ResNet. This code is written for CPU. I modify a little bit to put all the models and tensors to cuda to run it on GPUs. The main result I care about is in line 65 (the self.gradients).

The code works fine on single GPU. However, when I run the code on multiple GPUs with input size to be: 64x3x32x32 (cifar10 image dataset), the results I get is: 16x3x32x32 (it should be: 64x3x32x32).

To me, the problem seems to be on line 35: the register_backward_hook function failed to collect all the gradients from all the GPUs but the last one.

Am I doing something wrong or is this a known bug for PyTorch? If so, is there any way around for this issue?

Thank you very much!


Here in your code you’re setting

def hook_function(module, grad_in, grad_out):
    self.gradients = grad_in[0]

I think this happens on each GPU, so in the end you only get one-fourth of what you should have gotten (assuming 4 gpus).

You can try defining self.gradients as a python list, and then appending to it:

def hook_function(module, grad_in, grad_out):


Hi @richard, Thanks a lot for your help!

Yes, I tested your method and it is working perfectly! Thanks a lot!

I have another question. When I use the code on images with batch size 128, the memory of GPU is blown up. I then got an out of memory error. Do you have any suggestions on that?

Thanks again!


Other than shrinking the batch size, not really, sorry. Maybe someone else can weigh in here about how to better work with OOMs.