I was hoping to print and manually verify the gradient of intermediate layer parameters when using DataParallel. An example is below:
class Model(nn.Module):
def __init__(self):
super(Model, self).__init__()
self.func = nn.Linear(3,3, bias=False)
self.func2 = nn.Linear(3,3,bias=False)
def forward(self, x):
z = self.func(x)
# I want the gradient of self.func2.weight here. I can
# get it when using a single GPU, but not in multi GPU setting...
z = self.func2(z)
return z
net = Model()
para_net = nn.DataParallel(net)
xx = torch.randn(2,3).requires_grad_()
yy = para_net(xx)
loss = yy.mean() # Just to produce a scalar
loss.backward()
Everything works fine when I’m using a single GPU (e.g., I can intercept with a torch.autograd.Function
and manually modify the content of self.func2.weight.grad
). However, once I use multiple GPUs by setting CUDA_VISIBLE_DEVICES=0,1
, I can no longer access or modify it (e.g., if I intercept it, the printed self.func2.weight.grad
will be None
).
It’d be great if someone can help me resolve this issue, or point me to a solution!