Coming from Multi GPU backwards hook on wrong device, that fixed it so it didnt error anymore, but it is not actually returning the same value with 1 GPU and 2. I was informed in Initializing a member tensor after creation with DataParallel (repost) that I shouldn’t be using register_backwards_hooks at all, so now I am doing it with a custom function. I think this is all the relevant code:
My values tracker looks like
class valueTracker(nn.Module):
def __init__(self, out_channels):
super(valueTracker, self).__init__()
self.register_buffer('average', torch.zeros(out_channels, device=gf.device, dtype=torch.double))
Then in my module
#in init
self.values = nn.ModuleList([])
self.values.append(valueTracker(self.out_channels))
...
def forward(self,x):
...
out = saveAverageD(out, self.values)
return out
and the original backward hook that is now a function:
def saveAverageD(inp, Values):
class Saver(torch.autograd.Function):
@staticmethod
def forward(ctx, inp):
return inp
@staticmethod
def backward(ctx, grad_out):
#during n phase only one set of values needs to be tracked so save in 0 even if there are multiple candidates
with torch.no_grad():
Values[0].average = Values[0].average * 0.99 + grad_out.sum((0,2,3)) * 0.01
#I also tried with : Values[0].average = Values[0].average * 0.99 + grad_out.to(Values[0].average.device).sum((0,2,3)) * 0.01
return Saver.apply(inp)
I’ve written a thorough value checker that runs over 10 epochs and prints every value I can think of. With 1 gpu this does work exactly the same way as the register_backward_hook method. But with 2 GPUs the one value that changes is that this average does not remain consistent. My value checker prints the weights at every batch and they stay the same which means the grad being calculated must be the same.