Question about backward() in forward() on multi-gpus

I have a network like this:

class Net(nn.Module):
    def __init__(self, subnet):
        self.subnet = subnet

    def forward(self,input):
        x = self.subnet(input)
        loss = x.mean()

If I run such code in multiple GPUs by DataParallel. Are the backward()s and optimizer_of_subnets in different GPU independent? Will they influence each other?
For example, backward() will be called four times if the code runs on 4 GPUs. If they are not independent, I might get 4 times larger gradients.
I think they are independent, but not so sure. I will appreciate any reply.


If you don’t run inside DataParallel, this will work fine.
The thing is that DataParallel has a lot of logic on how to get the different copies on different devices and how to sync the gradients.
In particular I think that it will accumulate the gradients of your subnet automatically.
You can check this by using simple functions with known gradients. (Like addition that just backwards the original gradient)

According to my test, gradients on different GPUs don’t seem to affect each other.
But I found another strange phenomenon: the backward hooks registered in tensors of subnet are missed(not called) in some cases. I check this by following code:

class Subnet(nn.Module):
    def __init__(self):
        self.hook_flags = [0,0,0,0] #run on 4 GPUs
    def forward(self, input):
        device_id = input.device.index
        assert(self.hook_flags[device_id] == 0) #flag decrease by 1 in hook, increase by 1 in forward, so it should be 0 here
        self.hook_flags[device_id] += 1           
        x = self.layer1(input)
        x.register_hook(lambda grad: self.hook(grad, device_id))
    def hook(self,grad,device_id):
        self.hook_flags[device_id] -= 1

The assertion fails occasionally, so I think hooks are missed in some cases. I can’t understand. Is my code wrong?Or hooks are really missed? I will appreciate your help.
Note that backward() in forward() is called 4 times (4 GPUs) as my first post showed.

You do register the hook before the .backward() right?
Do you have a full code sample I can run locally to reproduce this please?

Hi all,

I also have a network which performs a backward in the forward pass, although it doesn’t use an optimizer step.
My issue is a bit unrelated, but it’s the closes example I have found online. Glad to open a new discussion if you think it would be best.

My issue is the following : I seem to have noticed an inefficiency in my forward pass when on multi-GPU. The GPUs are suddenly running with a low util, which doesn’t happen on a single GPU and doesn’t happen for models which don’t use a backward in the forward pass.

I would like to know if some syncing could be going on when using backward therefore making multi-GPU less efficient.

I am at the moment trying to find a reproducible example but it’s not that easy.