Model weights not getting updated

Sai1 · January 23, 2024, 1:43am

I am a rookie in Pytorch. I’m trying to build a simple NMT model with attention applied to bidirectional LSTM.

 def forward(self, x, h, c, s, loss_fn, y_true):
        loss_val=0
        self.batch_size = x.shape[0]
        activations = self.encoder(x, h, c)
        for tx in range(self.max_tx):
            alphas = self.attention(activations, s)
            shape = activations.shape
            activations_dot = activations.reshape(shape[1], shape[0], shape[2])
            context = torch.tensordot(alphas, activations_dot, dims=2).unsqueeze(dim=1)
            output, s = self.rnn(context, s)
            s = self.tanh(s)
            y = self.dec_fc2(s)
            y = self.dec_softmax(y)
            loss_val = loss_val + loss_fn(y.squeeze(), y_true[:, tx])
        return loss_val

Here’s my train block

for epoch in range(epochs):
        loss_fn = 0
        h0, c0 = model.init_hidden()
        s = model.init_hidden(1)
        for xt, yt in train_loader:
            loss_fn =  model.forward(xt, h0, c0, s, loss, yt)
            optimizer.zero_grad()
            loss_fn.backward()
            optimizer.step()
        print(loss_fn)

loss is NLLloss.
The loss is same or almost same at every epoch(minute difference in order 1e-6) and I tried printing the parameters which are same before and after backward. I need help solving this issue.
Thanks in advance.

vdw · January 23, 2024, 3:06am

Without seeing the model and any comments, it’s difficult to guess what’s going on here. Maye you want to have a look at the working notebook implementing NMT using an RNN-based encoder-decoder architecture.

Sai1 · February 25, 2024, 7:58pm

Thank you for the reply Dr Chris. I was going through the notebook which is exactly what I wanted. Now I have doubt. Why are encoder and decoder optimized differently for every epoch ?
Here’s my understanding. Consider a sentence of length 4 (My name is X (assume translation from english to some language)). I’ll have four hidden states here. H_4, the final state which is a condensed representation of all the hidden states. Now I pass this to another RNN block and using H_4 I generate the first word of the translation with the target as word “My” (which is a 100x1 vector(100 words in vocabulary)) and then I pass the first obtained probabilities as input to generate the next sequence in the word. From my understanding, isn’t this all a sequential process ? Is this same as having one optimizer and calling optimizer.step() once ?
Thanks

vdw · February 27, 2024, 1:18am

Hm, yes, I see no reason why this code

# Create model
model = RnnAttentionSeq2Seq(params, nn.CrossEntropyLoss()).to(device)
# Define optimizers for encoder and decoder
encoder_optimizer = optim.Adam(model.encoder.parameters(), lr=0.0005)
decoder_optimizer = optim.Adam(model.decoder.parameters(), lr=0.0005)

Could not be replaced with

# Create model
model = RnnAttentionSeq2Seq(params, nn.CrossEntropyLoss()).to(device)
# Define optimizers for encoder and decoder
optimizer = optim.Adam(model.parameters(), lr=0.0005)

As long as the learning rates are the same, both alternatives should be equivalent. You definitely right, the whole training is end-to-end.