Simple LSTM network fails to learn

I have a problem where 7 features is collected from some sensors everyday. The sensors are divided into X and y i.e., based on the values of sensors in X we need to predict a value for the sensors in y. I assume the observations depends on the 36 previous observations. My input is now [batchsize, 36, sensor_numbersx, features_dim] and target is [batchsize, sensor_numbersy, 1]. I have a very simple LSTM network as follows:

class LSTMMLP(nn.Module):
def init(self,
super(LSTMMLP, self).init()

    self.lstm_layers = nn.LSTM(input_dim*insties, 
    self.linear_layers = Sequential(Linear(32, 32), 
                                    Linear(32, output_dim), 

def forward(self, x):
    _, (hn, _) = self.lstm_layers(x)
    out = self.linear_layers(hn[-1])
    return out

loss_func = nn.L1Loss(reduction=‘mean’)
optimizer = torch.optim.SGD(model.parameters(), lr=lr)

I have nans in the targets. So, I did loss calculation as follows:

for i, data in enumerate(train_loader):
inputs, labels = data
inputs = inputs.reshape(inputs.shape[0], inputs.shape[1], -1)

    outputs = model(inputs)
    outputs = outputs.reshape(bsize, -1, 1)
    train_loss = loss_func(outputs[~torch.isnan(labels)], 
    tloss += train_loss.item()

I have checked the dimensions and model prameters and grads. They are all updating. But the problem is I am always getting same train and validation loss. It seems to me the model is not learning. Cannot find the error. It will be great if anyone can help please? Thank you all.

Here is my loss curves:

No matter what architecture I use, I get a similar plot

Since I see a lot of reshape operations in your code, I just leave this here.

I’m not saying that your code is wrong, but you might want to check if does indeed what you expect.

1 Like

Thank you so much @vdw I will definitely have a look and report back here soon. Your discussion on the reshape issue is so much helpful!! I had a feeling that I need to check my data transformations.

What is the difference between “sensorsnumbersx” and “features_dim”?

Hi @J_Johnson yes there is a difference, sensorsnumbers - is the number of sensors and each sensor provides readings for 7 variables (that is the feature dim). My data is a 3D tensor the first dim is the time stamp, the second dim is the sensors and the third dim is the features. For each timestamp, I have readings from sensorsnumbers sensors. Hope it makes it clearer. Thanks.

I am still not clear on the nature of these sensors. Do the sensors contain spatial information, such as sensor1, sensor2, sensor3, etc. are all in a row, but with the same measurements? Or are they separated by distance and/or take different measurements?

If there is spatial information, you may want to run a small 1d or 2d convolution network(depending on how they are spatially related) on them first, before running them through the LSTM. The convolution network should be arranged to encode the spatial information into a vector.

And if they do not contain spatial information, then you would do well to just flatten the last 2 dims before sending it into the LSTM, as you currently have it.

1 Like

Thank you for suggestion. There are spatial information that I am not considering for the model. Let me try again. For each timestamp, we get 7 variables e.g., temp, humidity, pressure etc if we assume the measurements per timestamp as a 2d matrix is of shape nsensors*7 then each column is a measured variable and each row is sensor.

I will definitely try CONV + LSTM architecture. Thank you for your suggestion.

@vdw @J_Johnson following your leads I made some changes to the model. Now my training loss is decreasing by validation loss is increasing after a first 3 to 6 epochs. I am using SGD with MSE loss and have tried high(0.01) to low (0.000001) lr. Still error curves. Any suggestions? Thank you so much for helping.

That may depend on your dataset size and representation.

When train loss and val loss are dropping, your model is generalizing to the data.

But when the val loss begins increasing, it means your model is over fitting to the train data, and so you should use the weights when val loss was at it’s lowest.

1 Like