Char_rnn_generation_tutorial : why is a loop used and not a sequence length

In general can I use sequence length in a normal rnn input instead of calculating the loss in a loop.

Following is the code snipet :

for i in range(input_line_tensor.size(0)):
        output, hidden = rnn(category_tensor, input_line_tensor[i], hidden)
        l = criterion(output, target_line_tensor[i])
        loss += l

The author in order to learn the temporal sequence between characters calculates loss by feeding each at a one timestep.

My question is as rnn pytorch docs define input as :

input of shape (seq_len, batch, input_size): tensor containing the features of the input sequence.

would I get the same result in terms of loss if the above entire sequence was passed in a single matrix rather than in a for loop.

My take: class RNN(nn.Module) implements just an RNN cell not a complete RNN layer like nn.LSTM or nn.GRU which hide the loop over each time step when you feed them the whole sequence – however, for text generation it’s still common to feed letters or words step by step since you have to have to calculate the loss at each step anyway.

The PyTorch docs for RNN do not apply here since class RNN(nn.Module) is custom code that has nothing to do with nn.RNN of PyTorch.

But given a sequence length for the same, would nn.RNN() calculate loss per sequence. or the loss would be over the batch.

for example. if I have 5 sequences, for batches of size 10 with input features 20 and output features 20 (after a fully connected layer).
for an input tensor of size (5, 10, 20) would both the below approaches mean he same loss ? :

  • if i looped over first dim 5 times, I get an o/p of size (1,10,20)
  • calculated a cross_entropy for 10 batches of sequence length one and did loss.backward() in each loop step - ie 5 times per iteration


  • I send the entire sequence at a single pass, I get an o/p of size(5,10,20)
  • calculate the loss over all 5*10 flattened out tokens and I do a loss.backward() in each iteration only once ?

I’m not sure if for the same network model both approaches would yield the exact same loss – apart from my last comment below – I’m not even sure if it indeed has to be the exact same loss.

For text generation models using RNNs, I usually see the first approach where the loss is computed for each word/letter step by step; see this Pytorch Seq2Seq tutorial for example, where the loss is computed between each input and target word of the sequence and summed up. I follow this approach for an RNN autoencoder which is essentially just a Seq2Seq model.

For a CNN autoencoder, I use the second approach, since there are no time steps, i.e., the whole output sequence is generated at the same time. Here, after the forward() I loop over each input-target pair to calculate the loss between words of all pairs.

loss.backward() in each loop step

I’ve never seen loss.backward() within a loop itself. This might be wrong – not sure if it will work in fact. I’ve only ever seen that the loss is aggregated (sum) within the loop and backward() called only once after the loop is finished.