LSTM time-series prediction

Ok now I’m confused :blush:

Let me backtrack a bit to make sure I understand what you’re saying. You’re right because I’m reshaping my timeseries into a bunch of sequences of length q and use each one to predict the “next” observation. My input (and ground-truth) data is organized sequentially (timesteps 1-20 to predict 21, 2-21 to predict 22 etc). I then run nn.LSTM model with input size (20, batch_size, 1) to predict that “next” value. Now, you’re saying this is memoryless, and you also mentioned using nn.LSTMCell to create a stateful lstm in an earlier reply, are these two things related? This is where my confusion lies because I can’t see how my setup is preventing cell state to be carried from one time step to the next (I guess I just assumed nn.LSTM would automagically do this because… its an LSTM :smiley: ).

its not completely memoryless but it only useful for what happens from say i.e. 3-22 but has no recollection of what happened at 1 and 2 to help predict 23 and can only draw on information from 3-22. Have you seen this repo?:

It should help you

I say lstmcell cause its more intuitive cause its unrolled so output ready for next input as you can just pass cell state from rolled lstm to next lstm sequence as it should be only final part of that rolled up lstm

and yes u would think automatically work this way but for lots of other sequences this not advantageous i.e… translating a sentence, the order of words on previous sentence should have no bearing on how you would translate current sentence

1 Like

I did see the example a few weeks ago but its more useful now that I’ve been learning a bit more. If I replicate this method with the time series I posted in the OP I get the same results:

The prediction in the sine wave example is done like this:

for i, input_t in enumerate(input.chunk(input.size(1), dim=1)):
    h_t, c_t = self.lstm1(input_t, (h_t, c_t))
    h_t2, c_t2 = self.lstm2(c_t, (h_t2, c_t2))
    outputs += [c_t2]

for i in range(future): # if we should predict the future
    h_t, c_t = self.lstm1(c_t2, (h_t, c_t))
    h_t2, c_t2 = self.lstm2(c_t, (h_t2, c_t2))
    outputs += [c_t2]

I can see the final state c_t2 in the first loop is used as input for the first predicted pairs h_t, c_t in the second loop. The overall logic seems to be: take your last cell state and use it as input along with the second to last cell/hidden states, calculate new cell/hidden states, use these as inputs, repeat.

Why use h_t, c_t in the first cell instead of h_t2, c_t2 if these are the last cell/hidden states? I’m also confused about why there are two time steps involved. I’d appreciate if someone can break down the logic a bit more.

Ok so in that example you generate 100 sine waves consisting of 1000 step points.

To train he takes 97 of these sine waves and uses 999 step points to predict final 1000th values and trains on these using loss between output and target.(key here is that he predicts 999 outputs t2-t1000 not just the t1000 output)

As he finishes each epoch he test on the final 3 sine waves left over predicting 999 points but he also then uses last output c_t2 to do future loop to then make the next prediction but also because he also created his next (h_t, c,_t), ((h_t2, c_t2) in first iteration so has all he needs to propogate to next step and does for next 1000

“Why use h_t, c_t in the first cell instead of h_t2, c_t2 if these are the last cell/hidden states?”

He is using those for the future predicts remember in final test he does those first 999 loop those values still there to use and then does a 1000 future predictions
h_t, c_t For lstm1
h_t2, c_t2 for lstm2

Also, I think what can get confusing here is that this is during the forward the pass and the whole network is based off of manipulating this cell state in temporal order to get more accurate via training with Back Propagation Through Time

2 Likes

I do have to note sorry never noticed as pytorch example model it is quite lacking in the helpful comments with code category lol

1 Like

Hi Alex, I am new to pytorch and also interested in time-series prediction.
I would like to ask some questions and appreciated you can share you idea.
In your training, did you reshape you input ?
Did you do loss = criterion(outputs, y) and how you match the dimension of output and y.

Thanks for breaking this down a bit, very helpful!

This was indeed a main source of confusion in my naive approach (I was predicting a single point t+1 from a size w window). Now I understand why you said earlier that the way I was setting up the problem was equivalent to a MLP (because there is no sequence for the LSTM to work with in the output).

Agreed. It’s fine since it’s an example and not a tutorial, but by god does it make things so much more confusing and difficult for beginners. On the upside it forces you to bleed through every single line of code, which is ok I guess if that’s your kink :slight_smile:

2 Likes

Yes I did reshape my input. Check this post where there’s an ongoing discussion that might be helpful to you.

The way you reshape your input is key. I started with a target set of only a single timestep, which I believe doesn’t work well for LSTM as the target itself needs to be a sequence in order to apply recurrency.

1 Like

Alex, thanks for the reply. I read the post you provided and I am confused about the shape.

“So if you divide a time series of length 10000 into chunks of length 50, your input tensor would be 50 (timesteps) by 200 (batch size) by 1 (features).”
"There are 200 batches in the dataset; each batch is 50x200x1."
For my understanding, so if we train only one batch? because 50*200=10000, 10000 is the total number of the data

so to sum up: if you want to train statefully a LSTM,

  • don’t shuffle your batches
  • use cell state and hidden state in the next forward pass to learn longterm dependencies

@chilango and @osm3000 did you try it out? if yes, what did you experience?

I didn’t try this (what you propose is called ‘stateful’ training if I remember correctly), although I expect it will work.
From my side, the conclusions (on generating sine wave) were:

  • Sequence length is a super important factor.
  • Just minimizing the prediction prediction (during the training) doesn’t reflect the expected the generation quality.

Since I first wrote about this experiments, we did more statistics on it in the recent period (sorry, I don’t have the graphs right now on this machine, I will upload them later). The results were super interesting (for me at least). I repeated this experiment for sequence lengths: 2, 5, 10, 20, 50, 100 timesteps (the cycle of the sine wave is 120 timesteps). For each sequence length, I trained 20 models, and I used each model to generate a sine-wave. I then evaluated each sine-wave (by hand) using a score from 0-3: 1 point for generating good amplitude, 1 for the frequency, and 1 for being centred around 0.

  1. The MSE for 2 & 5 is significantly worse than the rest.
  2. The MSE score for sequence lengths from 10 -> 100 is relatively the same
  3. I made a linear correlation between the sequence length/MSE and the generation quality:
    ~0.33 correlation factor between MSE and the generation quality
    ~0.95 correlation factor between sequence length and the generation quality

In short, I think there is more to this. We still have several hypotheses and experiments to do regarding this issue.

1 Like

I went with @osm3000 suggestions of discretizing my time series and been playing around with a ton of models (although I think the discretization is a hack that I’d like to overcome at some point) . IIRC I got similar results in which after about 50 timesteps I noticed no difference in accuracy. I suspect however that the choice of timesteps (seq_len) should be more substantively driven, as in if you have weekly seasonality or daily seasonality you might want to tailor your sequence length to (multiples) of your period.

Another important factor in my experience is the resolution of the grid you use to discretize (if you decide to go with such an intermediary solution).

Haven’t worked on this project for a month or two now, but I remember that playing around with the depth of the network didn’t really have much of an impact beyond 2-3 layers. I’ve read elsewhere that such LSTMs can probably model most time series, but Graves (2014) uses an 8 layer network to model character RNNs.

Because I have very unbalanced classes I am using weighted learning which is easily implemented in pytorch.

HTH.

I am a bit surprised that discretization didn’t give you better results.
Is it possible you share the code?

Oh no by all means it did :slight_smile: Reframing into a classification problem makes a lot of sense because thats where the mature models are. What I meant is that eventually I’d like to understand MDN (GMM) enough to implement it in pytorch myself, so can move my predictions from discrete to continuous, so any pointers appreciated.

Although this will sound strange, but the last research in Google (take a look at WaveNet Paper, https://arxiv.org/pdf/1609.03499.pdf , section 2.2. Also, check PixelRNN paper). I quote here:

However, van den Oord et al. (2016a) showed that a softmax distribution tends to work better, even when the data is implicitly continuous (as is the case for image pixel intensities or audio sample values). One of the reasons is that a categorical distribution is more flexible and can more easily model arbitrary distributions because it makes no assumptions about their shape.

Although these kind of statements should not be taken for granted (in my humble opinion), the very least to say - from my personal experience - is that life is much nicer with a softmax than a GMM. When I implemented AlexGraves model with GMM, it was quite unstable during the training (it diverges to inf, -inf, or nan, very easily). I tried several methods to control it, without success.

2 Likes

Since you already had success with generating in the discrete domain, I’ve a question for you: How do you evaluate the quality of the generated signals?

That’s a good question and one I don’t have a final solution yet. Right now I remap the predicted class to the midpoint of its (continuous) boundaries and use that as a prediction, and then simply measure the error with the testing set. I think using the actual average might be more accurate but haven’t tried it yet. I’m curious to see how you do it too.

In “A Clockwork RNN”

they use LSTMs for memorizing a timeseries. They do not discretize the output, they only scale it. But they mention something about “[…] initialize the bias of the forget gates to a high value (5 in this case) to encourage the long-term memory”.

Just as a pointer. I’m not doing anything with timeseries, but maybe it helps.

1 Like

Thanks for your great post @osm3000.
I’m running into a similar issue where I need to learn multiple random variables which are not independent.

Maybe you could try to first learn P(R1 | h) then sample R1 and learn P(R2 | R1, h)?

  1. Why is the input and target built on opposite indexes? I mean the -1 and 1 in the following lines
    input = Variable(torch.from_numpy(data[3:, :-1]), requires_grad=False)
    target = Variable(torch.from_numpy(data[3:, 1:]), requires_grad=False)

  2. I’m trying to modify the example so the result would be the a prediction not of the future sine values but rather the angle in radians that caused the specific sine value. Doing this as an exercise. the point is that sin values are ambiguous since the same value can have different originating angles in different quadrants, thus the need to know what was the sin value before the current one in-order to predict the future values.
    I thought it can be a good time series classification example. I save the orignal angle values used int he generator script

import numpy as np
import torch

np.random.seed(2)

T = 20
L = 1000
N = 100

x = np.empty((N, L), ‘int64’)
x[:] = np.array(range(L)) + np.random.randint(-4 * T, 4 * T, N).reshape(N, 1)
vals = x / 1.0 / T
data = np.sin(vals).astype(‘float64’)

save the labels to be used as target values later on

torch.save(vals,open(‘labels.pt’,‘wb’))
torch.save(data, open(‘traindata.pt’, ‘wb’))

…and then i use them as target in the train script after normalizing them to 0-2*pi values

“”“
input value between 0 - 360 deg
”""
def deg2norm(self,deg):
#return ((deg % (2 * math.pi)) - math.pi) / math.pi
return deg % (2 * math.pi)

"""
    Create a vector of the expected result who values are a notion of 0-360 degrees
    Due to ML constraints the values need to be in the range of -1..1
"""
def createTarget(self, data):
    # Push all the values intot he positive spectrum of 0 - 360 
    # From angle perspective 
    delta = (round(abs(np.min(data[:]))) + 1) * (2 * math.pi)
    val = data + delta
    return self.deg2norm(val)

labels = torch.load('labels.pt')
labelsTarget = seq.createTarget(labels)

target = Variable(torch.from_numpy(labelsTarget[3:]), requires_grad=False)

alas the net prediction fails although the loss function results are OK

Any idea ?