I want to alternately update two networks, say
m2. Both these networks are a part of another
nn.Module. For simplicity, we can assume that both
m2 are linear layers.
Currently, I have implemented this as shown below, however my loss remains nearly constant which indicates that gradients aren’t being updated correctly.
optimize_m1 = True for x1, x2 in train_loader: m1_optimizer.zero_grad() m2_optimizer.zero_grad() loss = model(x1, x2, optimize_m1=optimize_m1) loss.backward() if optimize_m1: m1_optimizer.step() else: m2_optimizer.step() optimize_m1 = not optimize_m1
Model is defined as below:
class Model(nn.Module): def __init__(self, in_dim, out_dim ) -> None: super().__init__() self.m1 = nn.Linear(in_dim, out_dim) self.m2 = nn.Linear(in_dim, out_dim) self.act = Activation() def forward(self, x1, x2, optimize_online=True): out1 = self.m1(x1) out2 = self.m2(x2) out1 = self.act(out1) out2 = self.act(out2) loss = mse(out1, out2) return loss
For optimizers, I am using SGD defined as
m1_optimizer = torch.optim.SGD(list(model.m1.parameters()), lr=config.lr) m2_optimizer = torch.optim.SGD(list(model.m2.parameters()), lr=config.lr)
Also, for loss do I need to use
out2 alternately, while optimizing the model? I am pretty sure that this simple architecture works because I replicated this with linear layers in numpy and the code works correctly. Trying to figure out where am I going wrong with PyTorch implementation.