I have a network like this:
def __init__(self, subnet):
self.subnet = subnet
x = self.subnet(input)
loss = x.mean()
If I run such code in multiple GPUs by
DataParallel. Are the
optimizer_of_subnets in different GPU independent? Will they influence each other?
For example, backward() will be called four times if the code runs on 4 GPUs. If they are not independent, I might get 4 times larger gradients.
I think they are independent, but not so sure. I will appreciate any reply.
If you don’t run inside DataParallel, this will work fine.
The thing is that DataParallel has a lot of logic on how to get the different copies on different devices and how to sync the gradients.
In particular I think that it will accumulate the gradients of your subnet automatically.
You can check this by using simple functions with known gradients. (Like addition that just backwards the original gradient)
According to my test, gradients on different GPUs don’t seem to affect each other.
But I found another strange phenomenon: the backward hooks registered in tensors of subnet are missed(not called) in some cases. I check this by following code:
self.hook_flags = [0,0,0,0] #run on 4 GPUs
def forward(self, input):
device_id = input.device.index
assert(self.hook_flags[device_id] == 0) #flag decrease by 1 in hook, increase by 1 in forward, so it should be 0 here
self.hook_flags[device_id] += 1
x = self.layer1(input)
x.register_hook(lambda grad: self.hook(grad, device_id))
self.hook_flags[device_id] -= 1
The assertion fails occasionally, so I think hooks are missed in some cases. I can’t understand. Is my code wrong？Or hooks are really missed? I will appreciate your help.
forward() is called 4 times (4 GPUs) as my first post showed.
You do register the hook before the
Do you have a full code sample I can run locally to reproduce this please?
I also have a network which performs a
backward in the forward pass, although it doesn’t use an optimizer step.
My issue is a bit unrelated, but it’s the closes example I have found online. Glad to open a new discussion if you think it would be best.
My issue is the following : I seem to have noticed an inefficiency in my forward pass when on multi-GPU. The GPUs are suddenly running with a low util, which doesn’t happen on a single GPU and doesn’t happen for models which don’t use a backward in the forward pass.
I would like to know if some syncing could be going on when using
backward therefore making multi-GPU less efficient.
I am at the moment trying to find a reproducible example but it’s not that easy.