Torch.no_grad() with DDP

be7f984b2f2e66ba7969 · June 15, 2021, 8:51am

Hi,

I tried to use torch.no_grad() with DDP, but it would throw

This error indicates that your
module has parameters that were not used in producing its output (the
return value of forward). You can enable unused parameter detection
by passing the keyword argument find_unused_parameters=True to
torch.nn.parallel.DistributedDataParallel

the pseudo code is as following


class MyModel(nn.Module):
    def forward(self, x):
        with torch.no_grad():
            self.layers(x)
        return x

class WholeFlow(nn.Module):
    def __init__(self):
        self.f=MyModel()
        self.g=nn.Linear(256, 256)
    def forward(self, x):
        x=self.f(x)
        x=self.g(x)
        return x

SGD=(WholeFlow.g.parameters(),...)

There is a similar issue here: DDP does not work well with `torch.no_grad()` in 1.2 · Issue #6087 · PyTorchLightning/pytorch-lightning · GitHub

It works with DataParallell, but can’t work with DDP.

Any idea?

ptrblck · June 15, 2021, 7:12pm

Did you try to add the suggested find_unused_parameters=True argument and if so, did you get any other error?

be7f984b2f2e66ba7969 · June 17, 2021, 1:23am

add find_unused_parameters=True works. but is this a bug for DDP?

ooohh · February 1, 2024, 10:25am

@ptrblck I meet a similar problem, I have a code like follows, only last frame will do bp and calculate loss. But all frames use the same parameters.
It works with find_unused_parameters=True, but the weights cannot be updated. I need the weights to be updated by last frame’s bp. So I can only use detach for this case?

class DummyDataset(torch.utils.data.Dataset):

    def __init__(self, seq_len=5):
        self.seq_len = seq_len
        

    def __len__(self):
        return 200

    def __getitem__(self, idx):
        return np.random.rand(self.seq_len, 10).astype(np.float32), np.random.rand(1).astype(np.float32)
        
class DummyNet(nn.Module):

    def __init__(self):
        super().__init__()

        self.mlp_fea = nn.Linear(10, 10)
        self.mlp_out = nn.Linear(10, 1)
    
    def forward(self, inputs, labels, training=False):
        inputs = inputs.cuda()
        labels = labels.cuda()

        B, SEQ, C = inputs.shape
        print(self.mlp_fea.weight)
        for i in range(SEQ):
            # print('seq: ', i)
            x = inputs[:, i]
            if i == 0:
                fea_prev = torch.zeros_like(x)

            if i < SEQ - 1:
                self.eval()
                with torch.no_grad():
                    fea_prev = self.mlp_fea(x + fea_prev)
                self.train()
            else:
                fea_prev = self.mlp_fea(x + fea_prev)
                out = self.mlp_out(fea_prev)
            if i == SEQ - 1:
                loss = nn.L1Loss(reduction='mean')(out, labels)
                return loss

ptrblck · February 1, 2024, 1:44pm

Could you explain how your use case is related to DDP as I don’t see the connection, please?

ooohh · February 1, 2024, 3:30pm

@ptrblck I use ddp to wrap the DummyModel, same problem with log

‘This error indicates that your
module has parameters that were not used in producing its output (the
return value of forward)’

ptrblck · February 1, 2024, 5:43pm

If you don’t want to synchronize the gradients, wrap the forward pass into a no_sync() context.