Incorporating "explicitly" the notion of "exact" time for RNN

Hi,
I’m creating a simple RNN which takes as input an hourly time series and outputs another hourly time series. We can think of the process as “denoising”.

I feed the data one day at a time, so basically, I feed sequences of 1-dim and length 24.

However, this vanilla RNN does not explicitly know that the first reading corresponds to 12 AM, next at 1 AM, and so on.

How can we incorporate this “time-of-day” feature into RNN?

One way could be to one-hot encode the hours, and then feeding in the concatenated 24 bit one-hot encoded sequence and the value at that time as an input for that time. This would lead to input being 25-dim and length 24, while output is sill 1-dim and length 24.

I feel this approach is odd, as it’s somehow forcing two different types of inputs (real-values v/s one-hot encoded) and also in a different range (say 0-1000 vs 0/1) to be a part of the same input sequence.

How would you suggest to incorporate time into RNN?

How about adding two features like this (EDIT: corrected)

sin(2 * pi * hour / 24)
cos(2 * pi * hour / 24)

The combination should allow the model to assimilate the notion of hour in a continuous, real-valued, yet circular way.

Thanks for your reply. Given that we have only a single reading per hour, our sine and cosine features may end up looking something like the following:

co

Are you suggesting to use something like the following:

  1. Input (batch size X 24 X 3) where we have dimensions corresponding to: sine, cosine, and the actual input signal
  2. Output (batch size X 24 X 1) where we have actual output

It is a good thing you thought to plot those features. I made a mistake. I should have put…

sin(2 * pi * hour / 24)
cos(2 * pi * hour / 24)

To answer your last question: Yes. I am suggesting using dimensions like those.

1 Like
1 Like

Thanks for your reply.

In practice I found the following:

  1. The train error would be significantly higher when I use the additional sine and cosine as inputs, in comparison to just using the 1d signal.
  2. I hypothesized that the above could be due to the fact that my 1d input signal is actually on a scale of 0-3000 or so, whereas the sine and the cosine representing the hour of the day are only on a scale of 0 to 1.
  3. To address #2, I normalized the 1d signals to a range of 0 to 1. However, now, I observe that my predictions during the training phase have a lot of negative values. This is odd since my actual 1d input signal and the output are both non-negative in nature.
  1. Odd. I would have thought that if the new inputs were not useful the model could simply ignore them.
  2. Unless you have rescaled the sine and cosine inputs, they are in the range -1 to 1.
  3. Normalising the data is often a good idea. A common method is to subtract the mean and divide by the standard deviation. That gives data that is mostly in the range -1 to 1. Apparently that helps the gradients to flow properly.

A few negative values would not overly surprise me, but it is odd that they appeared only after rescaling the inputs.

Another potential issue is the shortness of the batches. Each batch goes from midnight to midnight which probably means that backpropagation through time is cut off at midnight every night. This means that for the 1am reading the backpropagation is very limited, and for the 11pm reading the backpropagation is longer, but still not long enough to take into account any effects that last longer than 24 hours.

I would suggest using longer batches at least in the initial training.

Thanks. That all sounds like very reasonable advice.

I was currently training on around 5k samples of length 24 each. I guess that it’s not sufficient?

I am intrigued by your discussion on shortness of batches. I’d written an educational post on using RNNs for signal denoising in PyTorch. I’d noticed the effect you’re mentioning - that the predicted time series always used to have higher errors for the initial few points.

I guess one solution could be to use rolling 24 hour windows? Not only will it increase the effective sample space, but also have the advantage that the back propagation won’t be cut off early.

One thing you could do is to duplicate your series 24 times. The first copy would skip one hour at the beginning, the second copy would skip 2 hours, etc. Then you could combine these series in parallel to make a tensor of shape (seq_len, 24, n_features).

That gives you two benefits.

  1. batching allows more efficient computation
  2. you get to train it on all possible 24 hour windows at once.

I would still suggest increasing your window length.

If you don’t mind I’d be interested in seeing your code.

Thanks! Here’s my code

import torch
import torch.nn as nn
from torch.autograd import Variable
torch.manual_seed(0)
np.random.seed(0)

# Custom RNN

class CustomRNN(nn.Module):
    def __init__(self, cell_type, hidden_size, num_layers, bidirectional):
        super(CustomRNN, self).__init__()
        torch.manual_seed(0)

        if bidirectional:
            self.num_directions = 2
        else:
            self.num_directions = 1
        if cell_type=="RNN":
            self.rnn = nn.RNN(input_size=3, hidden_size=hidden_size,
                   num_layers=num_layers, batch_first=True,
                   bidirectional=bidirectional)
        elif cell_type=="GRU":
            self.rnn = nn.GRU(input_size=3, hidden_size=hidden_size,
                              num_layers=num_layers, batch_first=True,
                              bidirectional=bidirectional)
        else:
            self.rnn = nn.LSTM(input_size=3, hidden_size=hidden_size,
                              num_layers=num_layers, batch_first=True,
                              bidirectional=bidirectional)

        self.linear = nn.Linear(hidden_size*self.num_directions, 1 )
        self.act = nn.ReLU()

    def forward(self, x):
        pred, hidden = self.rnn(x, None)
        pred = self.linear(pred)
        
        #pred = torch.clamp(pred, min=0.)
        #pred = self.act(pred)
        #pred = torch.min(pred, x)
        return pred


num_folds = 5

if torch.cuda.is_available():
    cuda_av = True
else:
    cuda_av=False

# Specifying the params

fold_num = 0
num_folds = 5
cell_type="GRU"
hidden_size = 100
lr = 1
bidirectional = False

hours = np.arange(1, 25, 1)

# sine and co-sine for incorporating the hour of the day
d=pd.DataFrame([np.sin(2 * np.pi * hours/24), np.cos(2 * np.pi * hours/24)]).T

train, test = get_train_test(num_folds=num_folds, fold_num=fold_num)

train_inp = train[:, 0, :, :].reshape(-1, 24, 1) # continuous-valued input of length 24, dimension 1
train_out = train[:, 1, :, :].reshape(-1, 24, 1) # continuous-valued variable to be estimated of length 24, dimension 1

# Making train_inp_time of #samples, 24, 3
train_inp_time = np.zeros((train_inp.shape[0], 24, 3))
for sample in range(train_inp.shape[0]):
    temp = d.copy()
    temp['val'] = train_inp[sample, :, :]
    train_inp_time[sample :, :] = temp.values

loss_func = nn.L1Loss()
r = CustomRNN(cell_type, hidden_size, 1, bidirectional)

if cuda_av:
    r = r.cuda()
    loss_func = loss_func.cuda()

optimizer = torch.optim.Adam(r.parameters(), lr=lr)

num_iterations=100
for t in range(num_iterations):

    inp = Variable(torch.Tensor(train_inp_time), requires_grad=True)
    train_y = Variable(torch.Tensor(train_out))
    if cuda_av:
        inp = inp.cuda()
        train_y = train_y.cuda()
    pred = r(inp)
    print(pred.std().data[0], pred.mean().data[0])
    optimizer.zero_grad()
    loss = loss_func(pred, train_y)
    if t % 1 == 0:
        print(t, loss.data[0])
    loss.backward()
    optimizer.step()
    

The one major issue I see is in the line

pred, hidden = self.rnn(x, None)

You give hidden=None to the rnn which means that the rnn starts each batch with a new blank hidden state full of zeros. This means that when it sees the first reading of the day, the model has no memory of what happened yesterday. This will seriously limit its predictive ability.

I do see your point.

Actually, my input dimensions were: (num_homes, num_days, num_hours)

I’d gotten rid of the num_homes dimension, by treating each num_hours sample as independent, effectively creating: num_homes*num_days samples. I felt that there is a trade-off: if we train say on a per-home basis, we have fewer (but richer) samples overall v/s treating each num_hours sample as independent.

What do you think?

I guess your data is electricity consumption per home per hour or something like that.
Personally, I would keep the per home aspect because one home will behave similarly from day to day and from week to week. But then again it depends on which approach you prefer…

A model that can read in several days/weeks worth of data for one home and base its predictions on that might turn out to be very effective.

Another approach that might make more sense would be to average the data per hour over all the homes.

I see your point. However, my only concern is: this way, don’t we have a much smaller number of samples? Though, effectively the data we use to train the model remains the same!

I don’t quiet understand this. Is this a baseline you’re suggesting?

Also, did you find anything in the code that is indicative of the failure when I incorporate the sine and cosine?

I consider every timestep to be a data sample for the RNN, because for each timestep the rnn takes an input, combines it with its memory of the previous inputs, and produces an output. How we organise the samples into short or long sequences is largely irrelevant from that point of view. That said, if seasonality or trends in the input history is at all useful in predicting the output, then longer sequences will be useful.

No. I didn’t think it through, averaging the data wouldn’t make much sense except as a baseline. A useful baseline for the model loss could be produced by calculating the loss that you would get if you used averaged output values as predictions.

Concerning the incorporation of the sine and cosine you might have forgotten a comma in this line

but I don’t think it changes the way the code works.

I can’t see any other issues related to the sine and cosine features.

Day of week might be another useful feature to consider.

I am afraid I don’t fully follow. I guess I am not sure if I have the same definition of series as you’re thinking about. Would be great if you could elaborate a bit.

BTW, thanks a ton for your efforts. I really found the suggestions to be very useful. I’ll surely acknowledge your inputs!

I made this mistake while copying the code!

I see your point and it seems to be very valid! Of course, with a very long sequence, RNNs would suffer, and LSTMs, GRU might do much better.

That being said, I guess you’d be suggesting to use something like a single sequence per home of length num_days*24?

I see your point. I agree. If only I could get these additional features to work :frowning:

Exactly.

Maybe my ideas were badly formed too. I’ll try again. From your input data you produce sequences of length 24 that start at midnight, this is your training data. You could run through the process again producing sequences of length 24 that start at 1am, and add these to the training data. And again producing sequences that start at 2am, and so on. That way you would have windows that start at midnight, and windows that start at 1am, and windows that start at 2am, etc. Adding the hour features would be tricky though.

Are the target values normalized?
What about adding more training iterations? The model might need more time to adapt to the inputs when there are more of them.

1 Like

Not yet. Actually my input and output are somewhat related, in the sense that output < = input. So, I thought we should normalize them using the same function. What do you think?

Hmm. This is exactly what I had in mind when I mentioned rolling windows. This sounds interesting and hopefully beneficial too!

If the input and output have roughly the same range that should be reasonable.