I am a rookie in Pytorch. I’m trying to build a simple NMT model with attention applied to bidirectional LSTM.

def forward(self, x, h, c, s, loss_fn, y_true):
loss_val=0
self.batch_size = x.shape[0]
activations = self.encoder(x, h, c)
for tx in range(self.max_tx):
alphas = self.attention(activations, s)
shape = activations.shape
activations_dot = activations.reshape(shape[1], shape[0], shape[2])
context = torch.tensordot(alphas, activations_dot, dims=2).unsqueeze(dim=1)
output, s = self.rnn(context, s)
s = self.tanh(s)
y = self.dec_fc2(s)
y = self.dec_softmax(y)
loss_val = loss_val + loss_fn(y.squeeze(), y_true[:, tx])
return loss_val

Here’s my train block

for epoch in range(epochs):
loss_fn = 0
h0, c0 = model.init_hidden()
s = model.init_hidden(1)
for xt, yt in train_loader:
loss_fn = model.forward(xt, h0, c0, s, loss, yt)
optimizer.zero_grad()
loss_fn.backward()
optimizer.step()
print(loss_fn)

loss is NLLloss.
The loss is same or almost same at every epoch(minute difference in order 1e-6) and I tried printing the parameters which are same before and after backward. I need help solving this issue.
Thanks in advance.

Without seeing the model and any comments, it’s difficult to guess what’s going on here. Maye you want to have a look at the working notebook implementing NMT using an RNN-based encoder-decoder architecture.

Thank you for the reply Dr Chris. I was going through the notebook which is exactly what I wanted. Now I have doubt. Why are encoder and decoder optimized differently for every epoch ?
Here’s my understanding. Consider a sentence of length 4 (My name is X (assume translation from english to some language)). I’ll have four hidden states here. H_4, the final state which is a condensed representation of all the hidden states. Now I pass this to another RNN block and using H_4 I generate the first word of the translation with the target as word “My” (which is a 100x1 vector(100 words in vocabulary)) and then I pass the first obtained probabilities as input to generate the next sequence in the word. From my understanding, isn’t this all a sequential process ? Is this same as having one optimizer and calling optimizer.step() once ?
Thanks

# Create model
model = RnnAttentionSeq2Seq(params, nn.CrossEntropyLoss()).to(device)
# Define optimizers for encoder and decoder
encoder_optimizer = optim.Adam(model.encoder.parameters(), lr=0.0005)
decoder_optimizer = optim.Adam(model.decoder.parameters(), lr=0.0005)

Could not be replaced with

# Create model
model = RnnAttentionSeq2Seq(params, nn.CrossEntropyLoss()).to(device)
# Define optimizers for encoder and decoder
optimizer = optim.Adam(model.parameters(), lr=0.0005)

As long as the learning rates are the same, both alternatives should be equivalent. You definitely right, the whole training is end-to-end.