Conceptual question - is it correct to use categorical variables such as day, month, year as a fixed sequence input in LSTM?

I am working on a problem where I have to try to predict the dependent variable (continuous) every hour based on hourly temperature (the single continuous variable in predictor space), along with 4 categorical variables, i.e., an hour of the day, day of the week, month, and year (as shown below). I am planning to encode the categorical variables as embeddings.

Dependent variable Temperature hour day month year
15.6 30 0 20 March 1994
23.7 11 1 6 April 1992

and so on

My question is, in an LSTM model, creating fixed-length sequences of temperature (my continuous variable) makes sense, but I don’t find it convincing to create the fixed length of days, months, or years to feed in the LSTM cell. After all, when I create embeddings to represent the categorical variables of constant sized vectors, in a fixed-length of 4, the day, month, and the year categorical features and hence the embeddings will be the same. So how does it contribute towards model learning? Is it correct to create sequences of date and time based categorical data?

Hi, @deeplearner20

That’s an interesting question.
Commonly one uses RNNs to predict the next value in a row and it is trained on ground truth sequences to extract patterns from the context. It is definitely OK to try to predict the N-th day temperature based on the sequence of the previous N-1 values with LSTM.

If you trying to fit a LSTM network on months for examples that are not in order (i.e. March, May, February, May) that won’t make sense to try to obtain a pattern from that data. If you are trying to train it on ordered months, that would be overkill because you can predict the following month without a model.

As for me your task is looks like regression task that might involve some other methods like decision trees or linear regression. You can pass there your categorical variables and add there (as a separate feature) the embedding of temperatures. The embedding might be the following: train LSTM on ordered by time sequences of 4 temperatures to predict the 5th following, add the embedding of these 4 temperatures to your regression model. And don’t forget to avoid data leak, embedding shouldn’t be based on the temperature of the record you try to predict.

Hope my thoughts will be helpful

Hello @zetyquickly, thank you so much for replying. My dependent variable is not temperature. It can be (for example) hourly electricity consumption. Now, the reason I encode the date-time variables such as hour of the day, day of the week, month, and year as categorical variables is because these are ‘fixed effects’. The hour of the day basically accounts for the fact that the human electricity consumption on any specific hour remains constant over a period of time because of their behavior. Like at 8 am in the morning, we can expect someone to use more electricity every day as they wake up and prep to go to work compared to electricity consumption at 1 am in the night.

Yes, this is a multivariate regression problem. Currently, my hyperparameter tuning (just training the model with electricity consumption as the dependent variable and temperature as the sole regressor., minus any categorical variables) indicates that I should be using a fixed length of 24, that is, the nth hourly electricity demand is dependent on the n-24 values of hourly temperature gives the best accuracy (measured by RMSE and MAPE).

I have already tested my model with categorical regressors using OLS and Categorical Boosting, but the LSTM model performs much better. I just want to confirm if I am thinking in the right direction or not. Very much appreciate any guidance you can provide :slight_smile:

So, are you state that with a sole temperature as a parameter fit to the LSTM you are predicting the electricity consumption in the best way?

You can pass other categorical variables to an LSTM using vectors of values [temperature, day_of_month, etc...], but it may lead to a worse performance because of the factor that the model becomes too complex to fit the amount of available data. Which is not the case when you use only one parameter

Do I understood your demand?