Multi GPU backwards hook on wrong device

I have a backward hook function with ‘newLayer.register_backward_hook(hook_function)’ where I do not know how to control the inputs to it. The function contains a line like.

def hook_function(self, grad_input, grad_output):
    self.average = self.average * 0.99 + grad_output[0].sum((0,2,3)) * 0.01

This results in:

RuntimeError: expected device cuda:0 but got device cuda:1
(Pdb) self.average
tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], device='cuda:0',
       dtype=torch.float64)
(Pdb) grad_output[0].sum((0,2,3))
tensor([ 0.0508,  0.0492,  0.0512,  0.0487,  0.0517,  0.0483,  0.0522,  0.0479,
        -0.1974, -0.2026], device='cuda:1', dtype=torch.float64)

I need to know either how to define my self.average better or if i am doing something wrong somewhere else and it is grad_output that is wrong.

the above is with

self.average = torch.Tensor(out_channels).zero_().to(gf.device).double()

I have also tried

self.average = nn.Parameter(torch.Tensor(out_channels).zero_().to(gf.device).double())

which results in

TypeError: cannot assign 'torch.cuda.DoubleTensor' as parameter 'normalPassAverageD' (torch.nn.Parameter or None expected)

at the same location

I don’t know how self.average is initialized, but would assume this should work:

self.average = self.average.to(grad_output[0].device) * 0.99 + grad_output[0].sum((0,2,3)) * 0.01

Could you check it and see, if you are still getting an error?
In that case, could you post a small code snippet to reproduce this issue?

1 Like

That worked! Thanks!

But this is all within a DataParallel. Isn’t that supposed to make it play nice with multiple devices? I would have thought calling .to(a specific device) would limit it to only using that device. Maybe I don’t understand how DataParallel works entirely.

It’s hard to tell what’s creating the issue without seeing the code, but generally you are right. nn.DataParallel should take care of pushing the parameters and buffers to the right devices. However, if self.average was defined as at tensor, you would run into this error.

1 Like

It was in fact defined as a tensor. Is there another way I should define self tensors to be sure they are handled by DataParallel?

If you don’t need to update this tensor (i.e. no gradients should be calculated for it), you should define it as a buffer via: self.register_buffer('average', torch.tensor(...)). This would make sure that nn.DataParallel will push this buffer to the corresponding device. In your forward method you can access it via self.average.

1 Like

Cool! That worked for all of my tensors that are having their values be modified without the .to(). However, later in the same function there is one more tensor I am just saving directly with

self.current = grad_output[0].detach()

This works fine on one GPU but when running on multiple it tells me 'AttributeError: (module) object has no attribute ‘current’ '. I imagine I could do a .copy_() or something, but I won’t actually know the shape until the network has been run on an image. Thoughts? I can also start a new thread if that would be appropriate.

Manually manipulating model attributes when running a data parallel approach is a bit tricky, since each device uses a replica of the model and your changes could be lost.
Feel free to create a new topic and describe your use case further.

1 Like

Finally back to this after debugging other things. Apparently the .to() method made it run but it doesn’t get the same values which I imagine is because its not actually keeping the numbers the same. I’m using a registered buffer as well. Started a thread with all my current code, please take a look when you get a chance :slight_smile: Multi GPU Hook not correctly filling buffer

Have another thread now that includes a full code sample.