Why LSTM models do not require labels for each step

Hello,

For time related problems like, for example, stock prediction:

Let’s say we have 300 days of data, 10 features, and one target: the price.

Why, for the training, we only need the price of the 300th day?
I know this is the way LSTM models work, but wouldn’t it be useful to take into account the price of the 299 other days for the model?

Hi Nolwen,

Because LSTMs produce an output at each timestep, its perfectly reasonable to use all of the price information for training. This then becomes a sequence-to-sequence problem. There are various ways to go about doing this, assuming you’re using an MSE loss function, you could compute the MSE for your LSTM output (passed through a Linear layer so as to reduce dimensionality to the same size as your label, ie. 1) at each timestep h_t, compared with the correct label y_t, then sum the loss before backpropagating, or consider all output timesteps as a single vector, and compute the loss against the full vector of labels. There are many more complex architectures for sequence-to-sequence models, most notably encoder-decoder models, the type commonly used for machine translation.

It is also fine to simply use the final label.

Thanks a lot for your answer.

Just to be sure my problem is well understood.
Let say we have this kind of dataset (A 3d array of shape (nb of elements, number of days, number of features) :

Day 1   
 
            Feature 1 Feature 2 feature 3       label

data 1
data 2
…
…
Data n





Day 2  
 
            Feature 1 Feature 2 feature 3       label

data 1
data 2
…
…
Data n





Day 3  
 
            Feature 1 Feature 2 feature 3       label

data 1
data 2
…
…
Data n

For the moment, I train my model based on the labels of day 3, using day1 day 2 and day 3 of features. Thus, labels of day 1 and day 2 are unused with my classic RNN.

Are you saying that a model like a sequence to sequence would use the labels of day 1 and day 2??

Hey sorry I’m new to the forums and never recieved a notification when you replied.

Blockquote
Are you saying that a model like a sequence to sequence would use the labels of day 1 and day 2??

Yep I believe you have understood correctly. Consider my earlier suggestion of using MSE loss for a vector of labels against a vector of outputs. An LSTM with T inputs will give T outputs. You have as many labels as you have input vectors (also T). Instead of taking only the final LSTM output at time T, you could take all of the outputs, pass them individually through a fully-connected layer to reduce dimensionality, and then compute MSE between the vector of outputs and the vector of labels.

More sophisticated sequence to sequence models such as encoder-decoder models are more well-suited to machine translation tasks, where the output sequence could differ in length to the input sequence. In your case both are length T.

Thanks a lot.

If I understand the idea, and keeping my example with 3 days of data.

  • The LSTM model will have 3 outputs (one for each day)
  • I should save all the 3 outputs
  • Then, I should put these 3 outputs in a full connected layer, and train it with the 3 labels?

Is it the pipeline you are suggesting?

No problem :slight_smile:

So that is about right, but you don’t want to save the outputs and train the fully-connected layer separately, but rather use the fully-connected layer (Linear layer in PyTorch) as a final layer in one full network.

You have T timesteps (‘days’)
your features are of dimension f
your LSTM outputs will be whatever the hidden size of your lstm is, lets call that h
your labels (Y) are dimension o (which is 1)
you need to reduce the dimensionality of each h-dimensional LSTM output to o, for this you use a PyTorch Linear layer with h input dimensions and o output dimensions. In most cases, you can pass a sequence to a Linear layer and it will apply the same linear transformation to all vectors in the sequence. Your LSTM output shape will be [T, h], passing this through a linear layer will give a [T, o] output. Your labels are also [T, o], so it should be fine to use MSE to compute the loss over these vectors. Since the o dimension is just 1, you may have to call .squeeze(-1) on the output before passing it to MSE loss. I’m not exactly sure what shape vectors pytorch MSE expects.

input vector - [T, f]
x1 x2 x3 … xT - f-dimensional inputs to LSTM
LSTM output [T, h]
h1 h2 h3 … hT - h-dimensional outputs of LSTM
Linear layer output [T, o]
l1 l2 l3 … lT - o-dimensional Linear layer output

loss = MSE loss (linear layer output, labels)
backpropagate through the whole network (by calling loss.backward() then optimizer.step())

Thanks a lot! Now I totally get your idea. Still, I thing there is one point for which I have not be very clear I think.

What I want is to predict only the last label. So, let’s say I have T timesteps, I want to use T-1 labels to help predict the Tth label. However, here, I think that what we do is try to predict every labels.
What I want is to use the f features of the T days + T-1 labels in order to predict the Tth price.

Blockquote
I want to use T-1 labels to help predict the Tth label

In theory computing the loss over all of the labels during training should help the model predict the final label at inference time. You can train the model as described above, then for computing accuracy on the valid set, you only consider the final label and the final output of the LSTM.

Alternatively, you could append the t-1th label to the feature vector for xt, and compute the loss for the final LSTM state (or the mean over all LSTM states), passed through the linear layer, with yT. This would mean at t=0 you wouldn’t have a label yt-1, here you could append a 0 (sort of “padding”), or you could simply ignore this input.