Multi-Step time series LSTM Network


I am having issues with the LSTM function in pytorch. I am using an LSTM neural network to forecast a certain value. The input is multidimensional (multiple features) and the output should be one dimensional (only one feature that needs to be forecasted). I want to forecast something 1-6 timesteps in advance. I want to use multi timestep input as well. Now I have to different ways of achieving this but neither of them seem to work.

The first one is making 6 different many-to-one networks each forecasting another timestep 1-6h in advance, but still using the recurrent structure (see picture) when forecasting multiple time steps ahead (so not just shifting the target data by 1-6 hours). Another option would be a many to many neural network, this would work fine as well I think. (also see picture)

My input now looks like: where first the different parameters at the same timestep are grouped and then further each time all the timesteps we want to use in the forecast are grouped (look back)

tensor([[[-0.2800, -0.6381, -0.1033, -0.4941, 0.0016],
[-0.3159, 0.1378, -0.1010, -0.4529, 0.0016],
[-0.2800, 0.1378, -0.0963, -0.4706, 0.1150],
[-0.5673, -0.2149, -0.0598, -0.4000, 0.2850],
[-0.3518, -0.4265, -0.0669, -0.3646, 0.3417],
[-0.2440, -0.0738, -0.0657, -0.3823, 0.2283]],

    [[-0.3159,  0.1378, -0.1010, -0.4529,  0.0016],
     [-0.2800,  0.1378, -0.0963, -0.4706,  0.1150],
     [-0.7469,  0.1731, -0.0845, -0.4176,  0.3417],
     [-0.3518, -0.4265, -0.0669, -0.3646,  0.3417],
     [-0.2440, -0.0738, -0.0657, -0.3823,  0.2283],
     [-0.1722, -0.5323, -0.0610, -0.4117,  0.2283]],

    [[-0.2800,  0.1378, -0.0963, -0.4706,  0.1150],
     [-0.7469,  0.1731, -0.0845, -0.4176,  0.3417],
     [-0.7829, -0.4265, -0.0692, -0.4176,  0.4550],
     [-0.2440, -0.0738, -0.0657, -0.3823,  0.2283],
     [-0.1722, -0.5323, -0.0610, -0.4117,  0.2283],
     [-0.1363, -0.8850, -0.0669, -0.4294,  0.1150]],


    [[-0.3518,  0.2083, -0.1386,  0.8479, -0.1684],
     [-0.3518,  0.4552, -0.1398,  0.9480,  0.0016],
     [-0.2800, -0.4265, -0.1398,  0.9126,  0.0583],
     [-1.0343, -0.1443, -0.1433,  0.8479,  0.0016],
     [-0.8906,  0.3847, -0.1445,  1.0304, -0.2251],
     [-0.7829, -0.0385, -0.1433,  1.0127, -0.1117]],

    [[-0.3518,  0.4552, -0.1398,  0.9480,  0.0016],
     [-0.2800, -0.4265, -0.1398,  0.9126,  0.0583],
     [-0.4596, -0.9202, -0.1410,  0.8479,  0.1150],
     [-0.8906,  0.3847, -0.1445,  1.0304, -0.2251],
     [-0.7829, -0.0385, -0.1433,  1.0127, -0.1117],
     [-0.8547,  0.2436, -0.1422,  0.9715, -0.0550]],

    [[-0.2800, -0.4265, -0.1398,  0.9126,  0.0583],
     [-0.4596, -0.9202, -0.1410,  0.8479,  0.1150],
     [-0.6392, -0.5323, -0.1422,  0.8655,  0.0016],
     [-0.7829, -0.0385, -0.1433,  1.0127, -0.1117],
     [-0.8547,  0.2436, -0.1422,  0.9715, -0.0550],
     [-0.9984, -0.0033, -0.1422,  0.8597,  0.0583]]])

And the output looks like, where the different timesteps 1-6h in advance are grouped. I can change this easily.

tensor([[[ -7.],
[ -9.],

    [[ -9.],

     [ -9.]],


     [ -9.]],

     [ -9.],
     [ -8.]],

     [ -9.],
     [ -8.],

Now I have no idea how to use the LSTM structure to do multi timesteps forecasting. The output layer should be linear.

The batch size does not really matter to me, I think it can be one for now.

1 Like

Probably not a very useful answer, but at least some some ideas:

You could look into sequence-to-sequence / encoder-decoder models, which are essentially many-to-many solutions, most commonly used in machine translation. Both input and output are sequences, and since your input and output sequences have the same length, you can easily use batches.

Alternatively you can re-organize your dataset to allow for a (usually) simpler many-to-one model. It’s not clear if your input sequences of a fixed length. But lets say for a single data item the input is [A, B, C, D, E, F] and the output sequence is [1, 2 ,3]

Option A: The input sequences keep the fixed length:

 [A, B, C, D, E, F], [1]
 [B, C, D, E, F, 1], [2]
 [C, D, E, F, 1, 2], [3]

The prediction is the step by step. In each step you use the last prediction as last element of the input sequences kick out the first element.

Option B: You create sequences of variable length. for training you only might want to make sure that each batch contains only sequences of the same length.

 [A, B, C, D, E, F], [1]
 [A, B, C, D, E, F, 1], [2]
 [A, B, C, D, E, F, 1, 2], [3]

or use padding:

 [A, B, C, D, E, F, 0, 0], [1]
 [A, B, C, D, E, F, 1, 0], [2]
 [A, B, C, D, E, F, 1, 2], [3]
1 Like

Sorry if it was not completely clear.

The input is multidimensional (5) so I can not directly use your options, as A-F are five dimensional data and 1-3 are one dimensional. The input consists of not only the thing we want to forecast but multiple features.

Sorry, I see your point now! Technically, you could represent all data as a 6-dimensional vector, where A-F have zeros at the 6th dimension and 1-3 have zeros at positions 1-5. But, yeah, that’s probably too far fetched :).

Then you probably can still try to treat your problem like a sequence-to-sequence task. While 6 many-to-one network will work in principle, you obviously use the dependencies between the outputs – I assume that the 6h output not only depends on the input but also on the outputs for hours 1-5.

Yes the 6th output will preferably also depend on the output for 1-5 hours. If not I could just shift my output target data 6 hours, but then as far as I know, I am not using the strength of the recurrent structure.

Would you happen to have an example of the Seq2seq LSTM in pytorch? I have a little bit of problems trying to implement it. And I do notcompletely understand how to interpret the example here to change it to something more relevant for me.

Well, I’ve tried to come up with some kind of minimal example. I took some code of mine - which essentially is itself derived from the Seq2Seq tutorial you’ve linked - but adjusted to your use case. While it should run “as is”, I do not give any guarantees that it’s correct! Without any training data I cannot test of the loss decreases and the predictions get better over time.

I actually never used LSTMs for regression, so I’m not sure I’ve done it properly. I’ve commented the code a bit so it might give at least some pointers what’s going on.

import torch
import torch.nn as nn
import torch.optim as optim

import numpy as np

class Encoder(nn.Module):

    def __init__(self, input_size, hidden_dim, num_layers=1):
        super(Encoder, self).__init__()

        self.input_size = input_size
        self.hidden_dim = hidden_dim
        self.num_layers = num_layers
        self.lstm = nn.LSTM(self.input_size, self.hidden_dim, num_layers=self.num_layers)
        self.hidden = None

    def init_hidden(self, batch_size):
        return (torch.zeros(self.num_layers, batch_size, self.hidden_dim),
                torch.zeros(self.num_layers, batch_size, self.hidden_dim))

    def forward(self, inputs):
        # Push through RNN layer (the ouput is irrelevant)
        _, self.hidden = self.lstm(inputs, self.hidden)
        return self.hidden

class Decoder(nn.Module):

    def __init__(self, hidden_dim, num_layers=1):
        super(Decoder, self).__init__()
        # input_size=1 since the output are single values
        self.lstm = nn.LSTM(1, hidden_dim, num_layers=num_layers)
        self.out = nn.Linear(hidden_dim, 1)

    def forward(self, outputs, hidden, criterion):
        batch_size, num_steps = outputs.shape
        # Create initial start value/token
        input = torch.tensor([[0.0]] * batch_size, dtype=torch.float)
        # Convert (batch_size, output_size) to (seq_len, batch_size, output_size)
        input = input.unsqueeze(0)

        loss = 0
        for i in range(num_steps):
            # Push current input through LSTM: (seq_len=1, batch_size, input_size=1)
            output, hidden = self.lstm(input, hidden)
            # Push the output of last step through linear layer; returns (batch_size, 1)
            output = self.out(output[-1])
            # Generate input for next step by adding seq_len dimension (see above)
            input = output.unsqueeze(0)
            # Compute loss between predicted value and true value
            loss += criterion(output, outputs[:, i])
        return loss

if __name__ == '__main__':

    # 5 is the number of features of your data points
    encoder = Encoder(5, 128)
    decoder = Decoder(128)
    # Create optimizers for encoder and decoder
    encoder_optimizer = optim.Adam(encoder.parameters(), lr=0.001)
    decoder_optimizer = optim.Adam(decoder.parameters(), lr=0.001)
    criterion = nn.MSELoss()

    # Some toy data: 2 sequences of length 10 with 5 features for each data point
    inputs = [
            [0.5, 0.2, 0.3, 0.4, 0.1],
            [0.5, 0.2, 0.3, 0.4, 0.1],
            [0.5, 0.2, 0.3, 0.4, 0.1],
            [0.5, 0.2, 0.3, 0.4, 0.1],
            [0.5, 0.2, 0.3, 0.4, 0.1],
            [0.5, 0.2, 0.3, 0.4, 0.1],
            [0.5, 0.2, 0.3, 0.4, 0.1],
            [0.5, 0.2, 0.3, 0.4, 0.1],
            [0.5, 0.2, 0.3, 0.4, 0.1],
            [0.5, 0.2, 0.3, 0.4, 0.1],
            [0.5, 0.2, 0.3, 0.4, 0.1],
            [0.5, 0.2, 0.3, 0.4, 0.1],
            [0.5, 0.2, 0.3, 0.4, 0.1],
            [0.5, 0.2, 0.3, 0.4, 0.1],
            [0.5, 0.2, 0.3, 0.4, 0.1],
            [0.5, 0.2, 0.3, 0.4, 0.1],
            [0.5, 0.2, 0.3, 0.4, 0.1],
            [0.5, 0.2, 0.3, 0.4, 0.1],
            [0.5, 0.2, 0.3, 0.4, 0.1],
            [0.5, 0.2, 0.3, 0.4, 0.1],

    inputs = torch.tensor(np.array(inputs), dtype=torch.float)
    # Convert (batch_size, seq_len, input_size) to (seq_len, batch_size, input_size)
    inputs = inputs.transpose(1,0)

    # 2 sequences (to match the batch size) of length 6 (for the 6h into the future)
    outputs = [ [0.1, 0.2, 0.3, 0.1, 0.2, 0.3], [0.3, 0.2, 0.1, 0.3, 0.2, 0.1] ]
    outputs = torch.tensor(np.array(outputs), dtype=torch.float)

    # Do one complete forward & backward pass
    # Zero gradients of both optimizers
    # Reset hidden state of encoder for current batch
    encoder.hidden = encoder.init_hidden(inputs.shape[1])
    # Do forward pass through encoder
    hidden = encoder(inputs)
    # Do forward pass through decoder (decoder gets hidden state from encoder)
    loss = decoder(outputs, hidden, criterion)
    # Backpropagation
    # Update parameters
    print("Loss:", loss.item())

Is this step necessary? Does the network have to read the inputs in like this? Can you also input your network an input of (batch_size, seq_len, input_size) ?

And why do we do this step? Will this not erase any sort of training we’ve done?

It’s possible, but then you have to tell Pytorch so by using batch_first=True here:

    self.lstm = nn.LSTM(..., batch_first=True)

However, depending where you do it (the encoder, decoder, or both) you need to make other changes to the code. For example, when the decoder LSTM takes batch_first=True you probably have to change the following:

#input = input.unsqueeze(0)
input = input.unsqueeze(1)

#output = self.out(output[-1])
output = self.out(output[:, -1, :])

…and maybe other stuffm I’ve haven’t tested it, but batch_first=True also changes the shape of the output of the LSTM which naturally affects how to handle it for subsequent steps.

You might want to have a look at these posts: 1, 2, 3

Oh yes, I see now. This hidden are not the weights but a parameter that is calculated when we are running the neural network. Thank you very much for all the help!

I have implemented it for my problem and I got this training history (the x axis goes for each epoch through each batch and plots the loss on the y axis) Now I will make some grid for the hyper parameters. Again thanks for all the help.


Happy to help. It’s a good way to learn more myself. Happy coding!

Is this comment correct? Because how I saw it, the input size is actually the thing we defined as the hidden size. So the number of input features of the decoder is equal to the number of features the decoder gives the one time step output.

input_size has nothing to do with hidden_size.

hidden_size specifies the dimension of the internal hidden states of the RNN as defined when doing self.lstm = nn.LSTM(input_size, hidden_size, ...). This can be any value independet of input_size

input_size is the number of features for each element in the sequence. In your case, for the encoder input_size=6 and for the decoder input_size=1. Both encoder work that they take the output for the last step as input for the current step. And since the output of your decoder is just 1-dimensional, the input_size must be as well.

In classic machine translation where both input and target words are represented as vectors with the same dimension (e.g.,), then yes, encoder and decoder have the same input_size. In your case, however, encoder and decoder handle different types of inputs and targets: 6-dim for the encoder, 1-dim for the decoder.

This also means that the current input of the decoder has always the shape (1, batch_size, 1), which I was referring to in that comment.

if I understand correctly then: all the information of the sequence length * input size, so in my case 50 data points or so, is stored into just 6 numbers in the encoder? And the decoder takes only the last number to get information from?

Because I thought at first that what the decoder takes from the encoder would be multidimensional, so that there is less loss of information.

Best check this very useful diagram:

  • Since you have sequence of length 50, you have x_1, x_2, ..., x_50; each x_i is one time step
  • For you encoder, each x_i is 6-dimensional; for your decoder, each x_i is 1-dimensional
  • depth reflects num_layers in the code
  • The hidden state of a LSTM is a tuple (h_i, c_i), both have the same shape
  • The decoder gets the complete final hidden state (h_n, c_n) from the decoder (n=50 for you)
  • The shapes of h_i and c_i are defined by hidden_dim

To directly answer your questions:

all the information of the sequence length * input size, so in my case 50 data points or so, is stored into just 6 numbers in the encoder?

No, all the information of the input (seq_len*input_size) is stored as the last hidden state (h_n, c_n) which is independent from input_size. It depends on hidden_dim, ‘num_layers’, num_directions, and batch_size.

And the decoder takes only the last number to get information from?

No the decoder takes (h_n, c_n) is first hidden state. So yeah, the hidden state the decoder gets is multidimensional.

Because I thought at first that what the decoder takes from the encoder would be multidimensional, so that there is less loss of information.

Again, I think you confuse the hidden state with the input. Your decoder input/output is 1-dimensional (just a number). In the line output, hidden = self.lstm(input, hidden) of the decoder:

  • input.shape = (batch_size, input_size=1)
  • hidden[0].shape = hidden[1].shape = (input_size=1, batch_size, hidden_dim)

For the the encoder it’s the same, only with 6 instead of 1

Wow thank you, I think I understand it. Just to be sure, is this image then a correct representation? The opaque stuff is what is not given (input) or is not interesting anymore (output).

And so the number of LSTM layers and the layer size should be the same in encoder and decoder.

Yup, that’s pretty much what the code is doing – just to add:

  • The encoder get’s the whole sequence at once, while the decoder generates tokens/values time step by time step.
  • The decoder also gets some inputs x_i but x_1 is some fixed start value (0.0 in the code but can technically be anything), and x_(i+1) = output_i – that is, the input for time step i+1 is the last prediction of the decoder at time step i. That’s the input = output.unsqueeze(0) line. So strictly speaking the input for the decoder shouldn’t be opaque, but I get your point.
  • An extension of this model would be to use “teacher forcing” in the decoder, where for some examples in the data, not the predicted output is used as next input but the true values. You can read up on this in the PyTorch Seq2Seq tutorial, but it’s not important right now.

This figure makes it very clear how the decoder works by taking the last output as next input, and starting with a defaut token <START> (which is the 0.0 in the code since deal with single values)