RNNs: compute loss on last item in sequence or whole sequence?

Rob_Tandy · September 5, 2017, 1:23pm

I’m looking at implementing a simple character RNN to learn about RNNs and PyTorch at the same time. I’m looking at the example here:

github.com

vinhkhuc/PyTorch-Mini-Tutorials/blob/master/6_lstm.py

from __future__ import division
import numpy as np

import torch
from torch.autograd import Variable
from torch import optim, nn

from data_util import load_mnist


class LSTMNet(torch.nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super(LSTMNet, self).__init__()
        self.hidden_dim = hidden_dim
        self.lstm = nn.LSTM(input_dim, hidden_dim)
        self.linear = nn.Linear(hidden_dim, output_dim, bias=False)

    def forward(self, x):
        batch_size = x.size()[1]
        h0 = Variable(torch.zeros([1, batch_size, self.hidden_dim]), requires_grad=False)

This file has been truncated. show original

If a sequence vector and its y values from training data look like: x => ‘a b c d e’, y => ‘b c d e f’

The forward() method on LSTM returns a tensor of shape (sequence_length, num sequences in batch, input size). If the vocabulary size is 26, and we are using a batch of one, then the lstm’s forward() will return a tensor of shape (5, 1, 26).

My question is, do we compute the loss of this tensor vs all five elements in y? or just the last element? The examples above look like the return just the last predicted element from forward():

github.com

vinhkhuc/PyTorch-Mini-Tutorials/blob/master/6_lstm.py#L23


    super(LSTMNet, self).__init__()
    self.hidden_dim = hidden_dim
    self.lstm = nn.LSTM(input_dim, hidden_dim)
    self.linear = nn.Linear(hidden_dim, output_dim, bias=False)


def forward(self, x):
    batch_size = x.size()[1]
    h0 = Variable(torch.zeros([1, batch_size, self.hidden_dim]), requires_grad=False)
    c0 = Variable(torch.zeros([1, batch_size, self.hidden_dim]), requires_grad=False)
    fx, _ = self.lstm.forward(x, (h0, c0))
    return self.linear.forward(fx[-1])




def train(model, loss, optimizer, x_val, y_val):
x = Variable(x_val, requires_grad=False)
y = Variable(y_val, requires_grad=False)


# Reset gradient
optimizer.zero_grad()


# Forward

Then this must get loss computed vs the last element in y, ‘f’. (If I understand this right!) Is this what we want? Don’t we want to reinforce the predictions made by each item in the sequence?

Thanks for your help and apologies if I’m way off.

smth · September 30, 2017, 10:01pm

you usually do it over all timesteps on most tasks, but it depends on the task.

zzuczy · November 27, 2017, 2:30am

As you say The forward() method on LSTM returns a tensor of shape (sequence_length, num sequences in batch, input size). , if you want the all elements in y, just use it. However, if you just wanna the lasted element in y, you can index it with

fx[-1]

It depends on what you want and the task.

fyjl · May 22, 2018, 8:09am

Hi, because cross entropy can only calculate loss of a single time step, a normal way of training – input a batch of sequences and calculate loss accumulated over each time step of each sequence – should be implemented as follows:

input a batch of sequences, and initialize the hidden state of the model eg., an LSTM;
for each time sequence, iterate through each single time step and calculate its loss using cross entropy function. Do not clear out or re-initialize the hidden state in-between time steps of a sequence;
accumulate the loss, and .backward() once a sequence is consumed by the model.

I am wondering that how is the loss computed in this implementation.
In BPTT, suppose we have 4 time steps per input sequence (hence 4 loss terms corresponding to each time step); BPTT gives us a total loss L = L_4 + L_3 + L_2 + L_1, where L_n corresponds to time step n, in form of L_n = l_n^n + l_(n-1)^(n-1) + … + l_1^1, where n = 1~4.

However, in the implementation, we only have a single time step every time we forward and compute loss for each time step. Does that mean the L_n we get with the implementation is instead of the form L_n = l_1^1, regardless of n?