Multi-Step time series LSTM Network

Would you happen to have an example of the Seq2seq LSTM in pytorch? I have a little bit of problems trying to implement it. And I do notcompletely understand how to interpret the example here https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html to change it to something more relevant for me.

Well, I’ve tried to come up with some kind of minimal example. I took some code of mine - which essentially is itself derived from the Seq2Seq tutorial you’ve linked - but adjusted to your use case. While it should run “as is”, I do not give any guarantees that it’s correct! Without any training data I cannot test of the loss decreases and the predictions get better over time.

I actually never used LSTMs for regression, so I’m not sure I’ve done it properly. I’ve commented the code a bit so it might give at least some pointers what’s going on.

import torch
import torch.nn as nn
import torch.optim as optim

import numpy as np

class Encoder(nn.Module):

    def __init__(self, input_size, hidden_dim, num_layers=1):
        super(Encoder, self).__init__()

        self.input_size = input_size
        self.hidden_dim = hidden_dim
        self.num_layers = num_layers
        self.lstm = nn.LSTM(self.input_size, self.hidden_dim, num_layers=self.num_layers)
        self.hidden = None

    def init_hidden(self, batch_size):
        return (torch.zeros(self.num_layers, batch_size, self.hidden_dim),
                torch.zeros(self.num_layers, batch_size, self.hidden_dim))

    def forward(self, inputs):
        # Push through RNN layer (the ouput is irrelevant)
        _, self.hidden = self.lstm(inputs, self.hidden)
        return self.hidden


class Decoder(nn.Module):

    def __init__(self, hidden_dim, num_layers=1):
        super(Decoder, self).__init__()
        # input_size=1 since the output are single values
        self.lstm = nn.LSTM(1, hidden_dim, num_layers=num_layers)
        self.out = nn.Linear(hidden_dim, 1)

    def forward(self, outputs, hidden, criterion):
        batch_size, num_steps = outputs.shape
        # Create initial start value/token
        input = torch.tensor([[0.0]] * batch_size, dtype=torch.float)
        # Convert (batch_size, output_size) to (seq_len, batch_size, output_size)
        input = input.unsqueeze(0)

        loss = 0
        for i in range(num_steps):
            # Push current input through LSTM: (seq_len=1, batch_size, input_size=1)
            output, hidden = self.lstm(input, hidden)
            # Push the output of last step through linear layer; returns (batch_size, 1)
            output = self.out(output[-1])
            # Generate input for next step by adding seq_len dimension (see above)
            input = output.unsqueeze(0)
            # Compute loss between predicted value and true value
            loss += criterion(output, outputs[:, i])
        return loss


if __name__ == '__main__':

    # 5 is the number of features of your data points
    encoder = Encoder(5, 128)
    decoder = Decoder(128)
    # Create optimizers for encoder and decoder
    encoder_optimizer = optim.Adam(encoder.parameters(), lr=0.001)
    decoder_optimizer = optim.Adam(decoder.parameters(), lr=0.001)
    criterion = nn.MSELoss()

    # Some toy data: 2 sequences of length 10 with 5 features for each data point
    inputs = [
        [
            [0.5, 0.2, 0.3, 0.4, 0.1],
            [0.5, 0.2, 0.3, 0.4, 0.1],
            [0.5, 0.2, 0.3, 0.4, 0.1],
            [0.5, 0.2, 0.3, 0.4, 0.1],
            [0.5, 0.2, 0.3, 0.4, 0.1],
            [0.5, 0.2, 0.3, 0.4, 0.1],
            [0.5, 0.2, 0.3, 0.4, 0.1],
            [0.5, 0.2, 0.3, 0.4, 0.1],
            [0.5, 0.2, 0.3, 0.4, 0.1],
            [0.5, 0.2, 0.3, 0.4, 0.1],
        ],
        [
            [0.5, 0.2, 0.3, 0.4, 0.1],
            [0.5, 0.2, 0.3, 0.4, 0.1],
            [0.5, 0.2, 0.3, 0.4, 0.1],
            [0.5, 0.2, 0.3, 0.4, 0.1],
            [0.5, 0.2, 0.3, 0.4, 0.1],
            [0.5, 0.2, 0.3, 0.4, 0.1],
            [0.5, 0.2, 0.3, 0.4, 0.1],
            [0.5, 0.2, 0.3, 0.4, 0.1],
            [0.5, 0.2, 0.3, 0.4, 0.1],
            [0.5, 0.2, 0.3, 0.4, 0.1],
        ]
    ]

    inputs = torch.tensor(np.array(inputs), dtype=torch.float)
    # Convert (batch_size, seq_len, input_size) to (seq_len, batch_size, input_size)
    inputs = inputs.transpose(1,0)

    # 2 sequences (to match the batch size) of length 6 (for the 6h into the future)
    outputs = [ [0.1, 0.2, 0.3, 0.1, 0.2, 0.3], [0.3, 0.2, 0.1, 0.3, 0.2, 0.1] ]
    outputs = torch.tensor(np.array(outputs), dtype=torch.float)

    #
    # Do one complete forward & backward pass
    #
    # Zero gradients of both optimizers
    encoder_optimizer.zero_grad()
    decoder_optimizer.zero_grad()
    # Reset hidden state of encoder for current batch
    encoder.hidden = encoder.init_hidden(inputs.shape[1])
    # Do forward pass through encoder
    hidden = encoder(inputs)
    # Do forward pass through decoder (decoder gets hidden state from encoder)
    loss = decoder(outputs, hidden, criterion)
    # Backpropagation
    loss.backward()
    # Update parameters
    encoder_optimizer.step()
    decoder_optimizer.step()
    print("Loss:", loss.item())
5 Likes

Is this step necessary? Does the network have to read the inputs in like this? Can you also input your network an input of (batch_size, seq_len, input_size) ?

And why do we do this step? Will this not erase any sort of training we’ve done?

It’s possible, but then you have to tell Pytorch so by using batch_first=True here:

    self.lstm = nn.LSTM(..., batch_first=True)

However, depending where you do it (the encoder, decoder, or both) you need to make other changes to the code. For example, when the decoder LSTM takes batch_first=True you probably have to change the following:

#input = input.unsqueeze(0)
input = input.unsqueeze(1)

#output = self.out(output[-1])
output = self.out(output[:, -1, :])

…and maybe other stuffm I’ve haven’t tested it, but batch_first=True also changes the shape of the output of the LSTM which naturally affects how to handle it for subsequent steps.

You might want to have a look at these posts: 1, 2, 3

Oh yes, I see now. This hidden are not the weights but a parameter that is calculated when we are running the neural network. Thank you very much for all the help!

I have implemented it for my problem and I got this training history (the x axis goes for each epoch through each batch and plots the loss on the y axis) Now I will make some grid for the hyper parameters. Again thanks for all the help.

training

Happy to help. It’s a good way to learn more myself. Happy coding!

Is this comment correct? Because how I saw it, the input size is actually the thing we defined as the hidden size. So the number of input features of the decoder is equal to the number of features the decoder gives the one time step output.

input_size has nothing to do with hidden_size.

hidden_size specifies the dimension of the internal hidden states of the RNN as defined when doing self.lstm = nn.LSTM(input_size, hidden_size, ...). This can be any value independet of input_size

input_size is the number of features for each element in the sequence. In your case, for the encoder input_size=6 and for the decoder input_size=1. Both encoder work that they take the output for the last step as input for the current step. And since the output of your decoder is just 1-dimensional, the input_size must be as well.

In classic machine translation where both input and target words are represented as vectors with the same dimension (e.g.,), then yes, encoder and decoder have the same input_size. In your case, however, encoder and decoder handle different types of inputs and targets: 6-dim for the encoder, 1-dim for the decoder.

This also means that the current input of the decoder has always the shape (1, batch_size, 1), which I was referring to in that comment.

if I understand correctly then: all the information of the sequence length * input size, so in my case 50 data points or so, is stored into just 6 numbers in the encoder? And the decoder takes only the last number to get information from?

Because I thought at first that what the decoder takes from the encoder would be multidimensional, so that there is less loss of information.

Best check this very useful diagram:

  • Since you have sequence of length 50, you have x_1, x_2, ..., x_50; each x_i is one time step
  • For you encoder, each x_i is 6-dimensional; for your decoder, each x_i is 1-dimensional
  • depth reflects num_layers in the code
  • The hidden state of a LSTM is a tuple (h_i, c_i), both have the same shape
  • The decoder gets the complete final hidden state (h_n, c_n) from the decoder (n=50 for you)
  • The shapes of h_i and c_i are defined by hidden_dim

To directly answer your questions:

all the information of the sequence length * input size, so in my case 50 data points or so, is stored into just 6 numbers in the encoder?

No, all the information of the input (seq_len*input_size) is stored as the last hidden state (h_n, c_n) which is independent from input_size. It depends on hidden_dim, ‘num_layers’, num_directions, and batch_size.

And the decoder takes only the last number to get information from?

No the decoder takes (h_n, c_n) is first hidden state. So yeah, the hidden state the decoder gets is multidimensional.

Because I thought at first that what the decoder takes from the encoder would be multidimensional, so that there is less loss of information.

Again, I think you confuse the hidden state with the input. Your decoder input/output is 1-dimensional (just a number). In the line output, hidden = self.lstm(input, hidden) of the decoder:

  • input.shape = (batch_size, input_size=1)
  • hidden[0].shape = hidden[1].shape = (input_size=1, batch_size, hidden_dim)

For the the encoder it’s the same, only with 6 instead of 1

Wow thank you, I think I understand it. Just to be sure, is this image then a correct representation? The opaque stuff is what is not given (input) or is not interesting anymore (output).

And so the number of LSTM layers and the layer size should be the same in encoder and decoder.

Yup, that’s pretty much what the code is doing – just to add:

  • The encoder get’s the whole sequence at once, while the decoder generates tokens/values time step by time step.
  • The decoder also gets some inputs x_i but x_1 is some fixed start value (0.0 in the code but can technically be anything), and x_(i+1) = output_i – that is, the input for time step i+1 is the last prediction of the decoder at time step i. That’s the input = output.unsqueeze(0) line. So strictly speaking the input for the decoder shouldn’t be opaque, but I get your point.
  • An extension of this model would be to use “teacher forcing” in the decoder, where for some examples in the data, not the predicted output is used as next input but the true values. You can read up on this in the PyTorch Seq2Seq tutorial, but it’s not important right now.

This figure makes it very clear how the decoder works by taking the last output as next input, and starting with a defaut token <START> (which is the 0.0 in the code since deal with single values)

Ok thanks, i understand. I changed my code around a bit and now the initial input token is the last value of the known input data, should be a bit better then zero.

Hi, all
I was using this post as a guide to do the exact same thing (I think). I was just confused about the batches.
I have a large panel dataset of approximately 670 features along over 1800 days. My idea was to build a model that encodes sequences of these features (of length 45) and decodes another series, with length 7.
Is something as follows: I take X.iloc[n:n+45, :] (subsets of all features for 45 periods, for n=0, 1, …) and y[n+45:n+45+7] (a sequence of length 7 that starts just after the features end).
So a training pair for my model should be (X.iloc[n:n+45, :], y[n+45:n+45+7]) for some n.

I understood that in the example shown I should input these as batches (if n=0, 1, …, N I would have N+1 batches) and train with all the training sets every at every iteration, I am correct?
Because I have seen examples applied to NLP, and there are no batches (as in https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html). Instead every iteration of the training is done with a different training pair.

What is the difference between this two approaches? Did I get it right? Thanks for the help.

Hi, I am also trying to do the same thing. So in the end, did you lag your features by 6 hours?
For Chris’s example implementation, first 10 hour features input is matched and being trained to the future 6 hours output. Thanks for the help!

Hi, I’m using pretty much the exact same code, but I’m interested in the output values as well, so I’ve extracted the evaluation process into the train() method.

class Encoder(nn.Module):

    def __init__(self, input_size, hidden_dim, num_layers=1):
        super(Encoder, self).__init__()
        print('Initializing Encoder...')


        self.input_size = input_size
        self.hidden_dim = hidden_dim
        self.num_layers = num_layers
        # self.lstm = nn.LSTM(self.input_size, self.hidden_dim, num_layers=self.num_layers, batch_first=True)
        self.lstm = nn.LSTM(input_size=input_size, 
                          hidden_size=self.hidden_dim, 
                          num_layers=self.num_layers,
                          batch_first=True)  # Note that "batch_first" is set to "True"
        self.hidden = None

    def init_hidden(self, batch_size):
        return (torch.zeros(self.num_layers, batch_size, self.hidden_dim).to(device),
                torch.zeros(self.num_layers, batch_size, self.hidden_dim).to(device))

    def forward(self, inputs):
        # Push through RNN layer (the ouput is irrelevant)
        _, self.hidden = self.lstm(inputs, self.hidden)
        return self.hidden

class Decoder(nn.Module):

    def __init__(self, hidden_dim, num_layers=1):
        super(Decoder, self).__init__()
        print('Initializing Decoder...')
        # input_size=1 since the output are single values
        self.lstm = nn.LSTM(1, hidden_dim, num_layers=num_layers, batch_first=True)
        self.out = nn.Linear(hidden_dim, 1)

    def forward(self,decoder_input, outputs, hidden):
        batch_size, num_steps = outputs.shape
        input = decoder_input.view(batch_size,1)
        # Convert (batch_size, output_size) to (seq_len, batch_size, output_size)
        # In case of batch first
        input = input.unsqueeze(2)

        # loss = 0
        x = []
        for i in range(num_steps):
            # Push current input through LSTM: (seq_len=1, batch_size, input_size=1)
            output, hidden = self.lstm(input, hidden)
            # Push the output of last step through linear layer; returns (batch_size, 1)
        #     output = self.out(output[-1])
            # In case of batch first
            output = self.out(output[:, -1, :])
            # Generate input for next step by adding seq_len dimension (see above)
            input = output.unsqueeze(2)
            # Compute loss between predicted value and true value
        #     loss += criterion(output.squeeze(0), outputs[:, i])
            x.append(output)
        return torch.cat(x, dim=1)

I also have some previous values, so that I don’t initialize the Decoder with 0, but instead with previous values of the value that i want to predict.

Now i have a problem and a question:
First my problem: My output at some point is only NaNs, as well as the weights of my output layer in the Decoder. What would be possible reasons for that?

My first instinct would be to look at the activation function, but I’m not sure, how i should add an activation function in this case.

Hi Chris,

Good day.
I am also trying on the multi step forecasting.

I tried your posted example here.
But I got an error as follow:

/usr/local/lib/python3.7/dist-packages/torch/nn/modules/loss.py:528: UserWarning: Using a target size (torch.Size([2])) that is different to the input size (torch.Size([2, 1])). This will likely lead to incorrect results due to broadcasting. Please ensure they have the same size.
return F.mse_loss(input, target, reduction=self.reduction)

May I know how to solve this problem?
Thanks