LSTM on Time series with CrossEntropyLoss is unstable

I am training a LSTM model with batches using CrossEntropyLoss and weights because I have unbalanced time series dataset (this is not the main problem).
Since I’ve changed the code using CrossEntropyLoss instead of MSELoss the model takes lot of epochs and doesn’t converge. I am sure it is something to do with the change but I can’t find the issue.

See line with comment below.

def train_model(data_loader, model, loss_function, optimizer):
    num_batches = len(data_loader)
    total_loss = 0
    model.train()

    for X, y in data_loader:
        X = X.to(device=device)
        y = y.flatten().type(torch.LongTensor)  # added for Cross Entropy
        y = y.to(device=device)
        output = model(X)

        loss = loss_function(output, y) 
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        total_loss += loss.item()

    avg_loss = total_loss / num_batches
    
    return avg_loss

I am using batches to train the model.
The output variable looks like
tensor([[ 0.2105, 0.1974, -0.3235],
[ 0.2024, 0.2054, -0.3253], …
while the y is like:
tensor([1, 1, …

when before the change the y was
tensor([1], [1], …
which doesn’t work and issues.a “0D or 1D target tensor expected, multi-target not supported”

I’ve got 3 classes, I’ve tried to change hidden layers and learning rate but the problem persists.

Any help will be highly appreciated
Thanks

You have a classification task, so using the MSELoss is not suitable, which is used for regression tasks.

Could you also post the code of your model?

Couple of troubleshooting questions:

  1. Can you share the code used to calculate the weights?
  2. What is your lr set to?
  3. Can you describe the input dims? For example, (sequence length, batch size, features). And is batch_first set to True or False?

Model
Very simple LSTM implementation (see below), I’ve tried with different lr and batches (e.g. lr= 0.01, 0.001 - batches 48, 200, 500)

Input dataset:

  • category 1: 50,000 samples
  • category 2: 20,000 samples
  • cat 3: 20,000
    Total of 90,000.

I’ve calculated the weight with the usual formula: class_1_weight = total_samples / (num_classes * class1_samples)
which gives [0.6,1.5,1.5]

Example of an input is something like [0.45297918, 0.60572616]. I am trying with 2 to 10 features.
I have a MinMaxScaler 0 to 1, I’ve also tried without it.

Batch first true
So, for batch 64, the tensor is Size([64, 1, 2]) data something like
tensor([[[0.6057, 0.4530]],[[0.7728, 0.4666]], …

The results is still an imbalanced output towards the main class.

class MyLSTM(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers, num_classes):
        super(SimpleLSTM, self).__init__()
        
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        
        self.lstm = nn.LSTM(input_size, hidden_size, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_size, num_classes)
        
    def forward(self, x):
        h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size).to(device=device)
        c0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size).to(device=device)
        
        out, _ = self.lstm(x, (h0, c0)) 
        out = self.fc(out[:, -1, :])      
        
        return out

I noticed your sequence length is 1. Additionally, you are initializing the hidden state and cell state on every forward pass. The LSTM and other RNNs do best on sequence data, connecting relationships between temporal or spatial steps via the hidden state and cell state. As you currently have this written, a simple linear layer may do the same thing.

By the way, the PyTorch LSTM allows you to pass in the sequence, an initialized hidden and cell state, and then it runs the loop under the C++ hood.

Alternatively, if you want to run 1 sequential step at a time, you may want to move the h0 and c0 initialization outside of the forward pass, and pass those as inputs in your forward method, OR establish some mechanism so they only initiate on the first step. For example:

    def forward(self, i, x): #where i is a simple counting mechanism, i.e. 0, 1, 2, 3 ...
        if i ==0: #only initialize on the first time step
            self.h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size).to(device=device) #store these in the class until reinitializing via the self method
            self.c0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size).to(device=device)
        
        out, _ = self.lstm(x, (self.h0, self.c0)) 
        out = self.fc(out[:, -1, :])      
        
        return out

Thanks for that.
So, what you are saying is to initialise at the beginning of every epoch instead at every batch?

If so, I’ve just tried. I had to fix the fact that the last batch will be different.
Anyway, the loss curve is definitively better while the result is same.
For instance, checking the recall over the three categories for every configuration I try I’ve got roughly 0.9 for the most prominent one and ~0.2 for the others.

Any additional thoughts?

Let’s assume at each time step, you have some corresponding label. I.e. at time step 0 with input x{0}, the output should be y{0}.

So we put into the LSTM x{0-n}, with a size of (batch_size, n, in_features). And we have corresponding labels of size(batch_size, n, out_features). When we get to the Linear layer, the only thing we need to make sure matches the in_features size is the final dim of our LSTM output. The fact it has 3 dims doesn’t matter. So we should get a size out of the linear layer of (batch_size, n, out_features). And that matches the labels size. So we can use that in our loss function.

Thanks for your responses.

That’s where I get confused. If each time step effectively is training over a single minbatch that’s what my code already does, or I am getting that wrong?

Looking at the for loop (extract below) the “forward” get called once every batch.
Unless there is something under the hood that does the forward at each element of the mini-batch. So, if I would have an input as (batch_size, 3, in_features) the forward should be called 3 times for each batch. Is that correct?

 for X, y in data_loader:
        X = X.to(device=device)
        y = y.flatten().type(torch.LongTensor)  # added for Cross Entropy
        y = y.to(device=device)
        output = model(X) # here there is a forward call

Anyway, according to your previous point. I’ve changed the model and the loop as below but still I’ve got same problem. I start to think that is about the features I use.

class SimpleLSTM(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers, num_classes):
        super(SimpleLSTM, self).__init__()
        
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        
        self.lstm = nn.LSTM(input_size, hidden_size, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_size, num_classes)
        
    def forward(self, x, init):
        if (init):
            self.h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size).to(device=device)
            self.c0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size).to(device=device)
        out, _ = self.lstm(x, (self.h0, self.c0)) 
        out = self.fc(out[:, -1, :])      
        
        return out

def train_model(data_loader, model, loss_function, optimizer):
    num_batches = len(data_loader)
    total_loss = 0
    model.train()
    init = True
    for X, y in data_loader:
        X = X.to(device=device)
        y = y.flatten().type(torch.LongTensor) # only for Cross Entropy
        y = y.to(device=device)
        output = model(X, init)
        init = False
        loss = loss_function(output, y) # only for Cross Entropy
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        total_loss += loss.item()

    avg_loss = total_loss / num_batches
    
    return avg_loss

....

Anyway, I’ve tried everything. Changing features, passing more series but I can’t get a nice result. The output is still highly unbalanced.

There are two ways you can run an LSTM. One is you handle the additional loop, inside of your train loop. The other is you let Pytorch handle it in C++ under the hood. The latter is faster and less code.

Your current code is taking the outputs of the dataloader, which are likely shuffled, and then sending that into the model. Which means your sequential information is likely being lost from the dataloader.

Either way, you need the dataloader to return actual cross sections of your sequences. You want the dataloader to give out something of size (batch_size, sequence_length, features) where sequence length can be any value you choose(ideally a larger value if you are using an LSTM, to take advantage of it’s abilities). Then you would need to send into the model (batch_size, 1, features) with your current setup, where dim = 1 sequentially proceeds through each step. This means you would need a second loop inside the train loop. Alternatively, you could do something like this:

import torch
import torch.nn as nn
import math

device = torch.device("cpu")

class SimpleLSTM(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers, num_classes):
        super(SimpleLSTM, self).__init__()

        self.hidden_size = hidden_size
        self.num_layers = num_layers

        self.lstm = nn.LSTM(input_size, hidden_size, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_size, num_classes)

    def forward(self, x):
        self.h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size).to(device=device)
        self.c0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size).to(device=device)
        out, _ = self.lstm(x, (self.h0, self.c0))
        out = self.fc(out)
        return out

def get_class_targets(X):
    #get sine values
    raw_targets = torch.sin(X)

    #assign 3 classes with a 0.6, 0.2, 0.2 distribution
    targets = torch.zeros_like(raw_targets)
    mask1 = raw_targets>math.sin(0.45*math.pi)
    mask2 = raw_targets<math.sin(-0.45*math.pi)
    targets[mask1] = 1.
    targets[mask2] = 2.
    return targets

def get_class_accuracy(output, y, a_class):
    batch_size = output.size(0)
    seq_length = output.size(1)
    output = output.reshape(batch_size * seq_length, 3).detach().argmax(dim=1)
    y = y.reshape(-1)
    output = output[y==a_class]
    y_filtered = y[y==a_class]
    matching =  output ==y_filtered
    return torch.sum(matching.to(dtype=torch.long)) / (y_filtered.size(0))

def train_model(model, loss_function, optimizer):
    num_batches = 50000
    total_loss = 0
    model.train()

    batch_size = 20
    seq_length = 10
    for i in range(num_batches): #with a dataloader, you could just create an i = 0 outside the loop and then use i+=1 in the loop
        # this just makes a batch of integers with random starting points from 0 to 8, that count 10
        X = torch.randint(0, 8, (batch_size,1)).repeat(1, seq_length)+torch.range(0, seq_length-1).unsqueeze(0).repeat(batch_size, 1)

        y = get_class_targets(X).to(dtype=torch.long)

        output = model(X.unsqueeze(2))

        loss = loss_function(output.reshape(batch_size*seq_length, 3), y.reshape(-1))  # only for Cross Entropy

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        total_loss += loss.item()
        accuracy0 = get_class_accuracy(output, y, 0).item()
        accuracy1 = get_class_accuracy(output, y, 1).item()
        accuracy2 = get_class_accuracy(output, y, 2).item()

        print("Loss", loss.item(), "Class0 Accuracy", accuracy0, "Class1 Accuracy", accuracy1, "Class2 Accuracy", accuracy2)
    avg_loss = total_loss / num_batches

    return avg_loss

model = SimpleLSTM(input_size=1, hidden_size=256, num_layers=1, num_classes=3)

optimizer = torch.optim.SGD(model.parameters(), lr = 0.01)
weights = 1/(3*torch.tensor((0.9,0.05,0.05)))
print(weights)
criterion = nn.CrossEntropyLoss(weight = weights)

avg_loss = train_model(model, criterion, optimizer)

In the above, I train your model on integers, which then must predict the sine value in the sequence. Note that I send into the model (batch_size, sequence_length, features), and the model returns (batch_size, sequence_length, classes). So this lets the LSTM handle the loop internally.

The above problem also has an unbalanced class distribution of 0.9, 0.05, 0.05. I included an accuracy metric broken down by class so you can see that it is learning.

thanks for taking the time to explain in details.

I have run your code and compared with mine. Effectively, I did the same thing, the only difference is my sequence length (dim = 1). So, what I am getting it from you explanation is that forward gets called under the hood for every sequence. For instance, if we pass (batch_size, sequence_length, features) = 100, 10, 3, the forward is called 100 times.

I’ve increased the seq_length I am using torch.utils.data.DataLoader and unfortunately I can’t get nice results which I believe implies my input data is not representative. So, I will go back to the drawing board.

Many thanks

The batch_size dim gets treated in parallel. But the sequence_length dim is what goes through a loop under the hood, where the lstm takes each time step, i.e. x[:,0:1,:], and passes that through the lstm with the current h0 and c0 update to h1 and c1. Then those get passed in with x[:,1:2,:] and so on.

That yields an output of size (batch_size, sequence_length, hidden_size) where each of those time step outputs are ready for backprop.
Finally, that output will go through any additional layers you may have(i.e. Linear).