RNNs: compute loss on last item in sequence or whole sequence?

I’m looking at implementing a simple character RNN to learn about RNNs and PyTorch at the same time. I’m looking at the example here:

If a sequence vector and its y values from training data look like: x => ‘a b c d e’, y => ‘b c d e f’

The forward() method on LSTM returns a tensor of shape (sequence_length, num sequences in batch, input size). If the vocabulary size is 26, and we are using a batch of one, then the lstm’s forward() will return a tensor of shape (5, 1, 26).

My question is, do we compute the loss of this tensor vs all five elements in y? or just the last element? The examples above look like the return just the last predicted element from forward():

Then this must get loss computed vs the last element in y, ‘f’. (If I understand this right!) Is this what we want? Don’t we want to reinforce the predictions made by each item in the sequence?

Thanks for your help and apologies if I’m way off.

you usually do it over all timesteps on most tasks, but it depends on the task.

As you say The forward() method on LSTM returns a tensor of shape (sequence_length, num sequences in batch, input size). , if you want the all elements in y, just use it. However, if you just wanna the lasted element in y, you can index it with


It depends on what you want and the task.

Hi, because cross entropy can only calculate loss of a single time step, a normal way of training – input a batch of sequences and calculate loss accumulated over each time step of each sequence – should be implemented as follows:

  1. input a batch of sequences, and initialize the hidden state of the model eg., an LSTM;
  2. for each time sequence, iterate through each single time step and calculate its loss using cross entropy function. Do not clear out or re-initialize the hidden state in-between time steps of a sequence;
  3. accumulate the loss, and .backward() once a sequence is consumed by the model.

I am wondering that how is the loss computed in this implementation.
In BPTT, suppose we have 4 time steps per input sequence (hence 4 loss terms corresponding to each time step); BPTT gives us a total loss L = L_4 + L_3 + L_2 + L_1, where L_n corresponds to time step n, in form of L_n = l_n^n + l_(n-1)^(n-1) + … + l_1^1, where n = 1~4.

However, in the implementation, we only have a single time step every time we forward and compute loss for each time step. Does that mean the L_n we get with the implementation is instead of the form L_n = l_1^1, regardless of n?