Incorporating "explicitly" the notion of "exact" time for RNN

Nipun_Batra · January 22, 2018, 10:20pm

Hi,
I’m creating a simple RNN which takes as input an hourly time series and outputs another hourly time series. We can think of the process as “denoising”.

I feed the data one day at a time, so basically, I feed sequences of 1-dim and length 24.

However, this vanilla RNN does not explicitly know that the first reading corresponds to 12 AM, next at 1 AM, and so on.

How can we incorporate this “time-of-day” feature into RNN?

One way could be to one-hot encode the hours, and then feeding in the concatenated 24 bit one-hot encoded sequence and the value at that time as an input for that time. This would lead to input being 25-dim and length 24, while output is sill 1-dim and length 24.

I feel this approach is odd, as it’s somehow forcing two different types of inputs (real-values v/s one-hot encoded) and also in a different range (say 0-1000 vs 0/1) to be a part of the same input sequence.

How would you suggest to incorporate time into RNN?

jpeg729 · January 22, 2018, 10:25pm

How about adding two features like this (EDIT: corrected)

sin(2 * pi * hour / 24)
cos(2 * pi * hour / 24)

The combination should allow the model to assimilate the notion of hour in a continuous, real-valued, yet circular way.

Nipun_Batra · January 22, 2018, 10:34pm

Thanks for your reply. Given that we have only a single reading per hour, our sine and cosine features may end up looking something like the following:

Are you suggesting to use something like the following:

Input (batch size X 24 X 3) where we have dimensions corresponding to: sine, cosine, and the actual input signal
Output (batch size X 24 X 1) where we have actual output

jpeg729 · January 22, 2018, 11:41pm

It is a good thing you thought to plot those features. I made a mistake. I should have put…

sin(2 * pi * hour / 24)
cos(2 * pi * hour / 24)

To answer your last question: Yes. I am suggesting using dimensions like those.

jpeg729 · January 22, 2018, 11:42pm

Nipun_Batra · January 23, 2018, 4:33pm

Thanks for your reply.

In practice I found the following:

The train error would be significantly higher when I use the additional sine and cosine as inputs, in comparison to just using the 1d signal.
I hypothesized that the above could be due to the fact that my 1d input signal is actually on a scale of 0-3000 or so, whereas the sine and the cosine representing the hour of the day are only on a scale of 0 to 1.
To address #2, I normalized the 1d signals to a range of 0 to 1. However, now, I observe that my predictions during the training phase have a lot of negative values. This is odd since my actual 1d input signal and the output are both non-negative in nature.

jpeg729 · January 23, 2018, 4:53pm

Odd. I would have thought that if the new inputs were not useful the model could simply ignore them.
Unless you have rescaled the sine and cosine inputs, they are in the range -1 to 1.
Normalising the data is often a good idea. A common method is to subtract the mean and divide by the standard deviation. That gives data that is mostly in the range -1 to 1. Apparently that helps the gradients to flow properly.

A few negative values would not overly surprise me, but it is odd that they appeared only after rescaling the inputs.

Another potential issue is the shortness of the batches. Each batch goes from midnight to midnight which probably means that backpropagation through time is cut off at midnight every night. This means that for the 1am reading the backpropagation is very limited, and for the 11pm reading the backpropagation is longer, but still not long enough to take into account any effects that last longer than 24 hours.

I would suggest using longer batches at least in the initial training.

Nipun_Batra · January 23, 2018, 6:04pm

Thanks. That all sounds like very reasonable advice.

I was currently training on around 5k samples of length 24 each. I guess that it’s not sufficient?

I am intrigued by your discussion on shortness of batches. I’d written an educational post on using RNNs for signal denoising in PyTorch. I’d noticed the effect you’re mentioning - that the predicted time series always used to have higher errors for the initial few points.

I guess one solution could be to use rolling 24 hour windows? Not only will it increase the effective sample space, but also have the advantage that the back propagation won’t be cut off early.

jpeg729 · January 23, 2018, 6:58pm

One thing you could do is to duplicate your series 24 times. The first copy would skip one hour at the beginning, the second copy would skip 2 hours, etc. Then you could combine these series in parallel to make a tensor of shape (seq_len, 24, n_features).

That gives you two benefits.

batching allows more efficient computation
you get to train it on all possible 24 hour windows at once.

I would still suggest increasing your window length.

If you don’t mind I’d be interested in seeing your code.

Nipun_Batra · January 23, 2018, 7:12pm

Thanks! Here’s my code

import torch
import torch.nn as nn
from torch.autograd import Variable
torch.manual_seed(0)
np.random.seed(0)

# Custom RNN

class CustomRNN(nn.Module):
    def __init__(self, cell_type, hidden_size, num_layers, bidirectional):
        super(CustomRNN, self).__init__()
        torch.manual_seed(0)

        if bidirectional:
            self.num_directions = 2
        else:
            self.num_directions = 1
        if cell_type=="RNN":
            self.rnn = nn.RNN(input_size=3, hidden_size=hidden_size,
                   num_layers=num_layers, batch_first=True,
                   bidirectional=bidirectional)
        elif cell_type=="GRU":
            self.rnn = nn.GRU(input_size=3, hidden_size=hidden_size,
                              num_layers=num_layers, batch_first=True,
                              bidirectional=bidirectional)
        else:
            self.rnn = nn.LSTM(input_size=3, hidden_size=hidden_size,
                              num_layers=num_layers, batch_first=True,
                              bidirectional=bidirectional)

        self.linear = nn.Linear(hidden_size*self.num_directions, 1 )
        self.act = nn.ReLU()

    def forward(self, x):
        pred, hidden = self.rnn(x, None)
        pred = self.linear(pred)
        
        #pred = torch.clamp(pred, min=0.)
        #pred = self.act(pred)
        #pred = torch.min(pred, x)
        return pred


num_folds = 5

if torch.cuda.is_available():
    cuda_av = True
else:
    cuda_av=False

# Specifying the params

fold_num = 0
num_folds = 5
cell_type="GRU"
hidden_size = 100
lr = 1
bidirectional = False

hours = np.arange(1, 25, 1)

# sine and co-sine for incorporating the hour of the day
d=pd.DataFrame([np.sin(2 * np.pi * hours/24), np.cos(2 * np.pi * hours/24)]).T

train, test = get_train_test(num_folds=num_folds, fold_num=fold_num)

train_inp = train[:, 0, :, :].reshape(-1, 24, 1) # continuous-valued input of length 24, dimension 1
train_out = train[:, 1, :, :].reshape(-1, 24, 1) # continuous-valued variable to be estimated of length 24, dimension 1

# Making train_inp_time of #samples, 24, 3
train_inp_time = np.zeros((train_inp.shape[0], 24, 3))
for sample in range(train_inp.shape[0]):
    temp = d.copy()
    temp['val'] = train_inp[sample, :, :]
    train_inp_time[sample :, :] = temp.values

loss_func = nn.L1Loss()
r = CustomRNN(cell_type, hidden_size, 1, bidirectional)

if cuda_av:
    r = r.cuda()
    loss_func = loss_func.cuda()

optimizer = torch.optim.Adam(r.parameters(), lr=lr)

num_iterations=100
for t in range(num_iterations):

    inp = Variable(torch.Tensor(train_inp_time), requires_grad=True)
    train_y = Variable(torch.Tensor(train_out))
    if cuda_av:
        inp = inp.cuda()
        train_y = train_y.cuda()
    pred = r(inp)
    print(pred.std().data[0], pred.mean().data[0])
    optimizer.zero_grad()
    loss = loss_func(pred, train_y)
    if t % 1 == 0:
        print(t, loss.data[0])
    loss.backward()
    optimizer.step()

jpeg729 · January 23, 2018, 7:45pm

The one major issue I see is in the line

pred, hidden = self.rnn(x, None)

You give hidden=None to the rnn which means that the rnn starts each batch with a new blank hidden state full of zeros. This means that when it sees the first reading of the day, the model has no memory of what happened yesterday. This will seriously limit its predictive ability.

Nipun_Batra · January 23, 2018, 7:50pm

I do see your point.

Actually, my input dimensions were: (num_homes, num_days, num_hours)

I’d gotten rid of the num_homes dimension, by treating each num_hours sample as independent, effectively creating: num_homes*num_days samples. I felt that there is a trade-off: if we train say on a per-home basis, we have fewer (but richer) samples overall v/s treating each num_hours sample as independent.

What do you think?

jpeg729 · January 23, 2018, 8:02pm

I guess your data is electricity consumption per home per hour or something like that.
Personally, I would keep the per home aspect because one home will behave similarly from day to day and from week to week. But then again it depends on which approach you prefer…

A model that can read in several days/weeks worth of data for one home and base its predictions on that might turn out to be very effective.

Another approach that might make more sense would be to average the data per hour over all the homes.

Nipun_Batra · January 23, 2018, 8:08pm

I see your point. However, my only concern is: this way, don’t we have a much smaller number of samples? Though, effectively the data we use to train the model remains the same!

I don’t quiet understand this. Is this a baseline you’re suggesting?

Also, did you find anything in the code that is indicative of the failure when I incorporate the sine and cosine?

jpeg729 · January 23, 2018, 8:26pm

I consider every timestep to be a data sample for the RNN, because for each timestep the rnn takes an input, combines it with its memory of the previous inputs, and produces an output. How we organise the samples into short or long sequences is largely irrelevant from that point of view. That said, if seasonality or trends in the input history is at all useful in predicting the output, then longer sequences will be useful.

No. I didn’t think it through, averaging the data wouldn’t make much sense except as a baseline. A useful baseline for the model loss could be produced by calculating the loss that you would get if you used averaged output values as predictions.

Concerning the incorporation of the sine and cosine you might have forgotten a comma in this line

but I don’t think it changes the way the code works.

I can’t see any other issues related to the sine and cosine features.

Day of week might be another useful feature to consider.

Nipun_Batra · January 23, 2018, 8:29pm

I am afraid I don’t fully follow. I guess I am not sure if I have the same definition of series as you’re thinking about. Would be great if you could elaborate a bit.

BTW, thanks a ton for your efforts. I really found the suggestions to be very useful. I’ll surely acknowledge your inputs!

Nipun_Batra · January 23, 2018, 8:34pm

I made this mistake while copying the code!

I see your point and it seems to be very valid! Of course, with a very long sequence, RNNs would suffer, and LSTMs, GRU might do much better.

That being said, I guess you’d be suggesting to use something like a single sequence per home of length num_days*24?

I see your point. I agree. If only I could get these additional features to work

jpeg729 · January 23, 2018, 9:20pm

Exactly.

Maybe my ideas were badly formed too. I’ll try again. From your input data you produce sequences of length 24 that start at midnight, this is your training data. You could run through the process again producing sequences of length 24 that start at 1am, and add these to the training data. And again producing sequences that start at 2am, and so on. That way you would have windows that start at midnight, and windows that start at 1am, and windows that start at 2am, etc. Adding the hour features would be tricky though.

Are the target values normalized?
What about adding more training iterations? The model might need more time to adapt to the inputs when there are more of them.

Nipun_Batra · January 23, 2018, 11:30pm

Not yet. Actually my input and output are somewhat related, in the sense that output < = input. So, I thought we should normalize them using the same function. What do you think?

Hmm. This is exactly what I had in mind when I mentioned rolling windows. This sounds interesting and hopefully beneficial too!

jpeg729 · January 24, 2018, 8:27am

If the input and output have roughly the same range that should be reasonable.