Mechanics of the optimizer momentum when updating a conditional neural network module

Ambrose · January 2, 2022, 3:39pm

Question is after code block. My module has two parts, partA and partB which are triggered conditionally in the forward function:

class Network(nn.Module):
    def __init__(self):
        super(Network, self).__init__()

    self.partA = nn.Sequential(
        convolutions
    )

    self.partB = nn.Sequential(
        convolutions
        nn.Flatten()
        nn.Linear()
    )

    def forward(input_tensor):
        if conditionB:
            return self.partB(input_tensor)
        elif conditionA: # for clarity. could just use 'else' here.
            return self.partB(self.partA(input_tensor))

## initialize and train
model = Network().to(device)
opt = optim.SGD(model.parameters(), lr=0.0001, momentum=0.9)

for epoch in range(num_epochs):
    for i, data in enumerate(trainloader):
        model.zero_grad()
        outputA = model(input_tensor) # condition A
        err = criterion(labels, outputA)
        err.backwards()
        opt.step()

        model.zero_grad()
        outputB = model(input_tensor) # condition B
        err = criterion(labels, outputB)
        err.backwards()
        opt.step()

During training, when condition B is triggered, there will be gradients through partB of the network but not partA. However, will the momentum term in SGD cause an update in partA when opt.step() is called, even though it was not involved in the computational graph during the condition B step? If so, how should I avoid this? Give partA and partB separate optimizers?

tom · January 2, 2022, 10:10pm

I would indeed try to use separate optimizers here.
I think you could also replace for p in model.parameters(): p.grad = None to see if that skips the momentum update, but this is less obvious to the user so the next change might break it again.

Best regards

Thomas

Ambrose · January 3, 2022, 2:46pm

Thanks Thomas! I separated them into multiple optimizers and it seems to work.