LSTM make sure output sums to 100

I am trying to predict a set of 7 classes in time steps of 4. The predicted values are numerical. For each time step, the sum of the predicted values of the 7 classes needs to be 100. One class can have a value of 100, but then the other classes by definition are 0. My LSTM outputs a tensor with shape [batch_size, sequence_length, output_size], where batch_size = 64, sequence_length = 4 and output_size = 7. However, currently my LSTM sometimes predicts such as the following:

[[0, 0, 0, 44, 0, 6, 0] [100, 0, 0, 0, 0, 0, 0] [78, 0, 0, 5, 0, 0, 0] [0, 30, 0, 0, 70, 0, 0]]

This would depict a four step sequence of the classes. As you can see, in timestep 1: 44+6 != 100, as well as in timestep 3: 78+5 here. However, I want my LSTM model to predict a total that sums to 100 for every timestep.

My LSTM model:

class LSTMModel(nn.Module):
    def __init__(self, input_size, hidden_size, output_size, num_layers=1):
        super().__init__()
        self.lstm = nn.LSTM(input_size, hidden_size, num_layers=num_layers, batch_first=True)
        self.linear = nn.Linear(hidden_size, output_size)
        init.xavier_uniform_(self.linear.weight)
        
    def forward(self, x):
        x, _ = self.lstm(x)
        x = self.linear(x)
        x = torch.clamp(x, min=0, max=100)  
        return x

Here I am using x = torch.clamp(x, min=0, max=100) to constrain the output of one class per time step to be at minimum 0 and at maximum 100. Before, it sometimes predicted negative values. I now wonder if there is a similar way that constrains the output of the LSTM to be a sum of 100 and nothing more or less. So say something like: x = torch(sum(x) = 100)

Currently, I simply rescale my output (x/sum * 100) after it has come out of the model, but I wonder if I can do this within the model, so that I force the model to predict something that (closely) adds up to 100. Because now it sometimes predicts sums of lower than 50 (and sometimes even 0) and then rescaling will only damage the accuracy I think.

Is this doable within the LSTM model? Or do I simply need to be more clever when rescaling it later? Thanks in advance!

Hi,

your are in a classification settings, therefore, I would suggest you to use either a sigmoid or a softmax for the output activation function.

Clamping is probably not a good idea as it would render a flat gradient bringing no information when learning.

Hi, thanks for your reaction. What do you mean by “in classification settings”? I should be doing: self.linear.activation = sigmoid for example?

Could you also elaborate on the clamping part? What should I be doing instead? Also, what should I do the get the sum to be (near) 100? Or do you mean that I shouldn’t do that at all in the model as that would render flat gradients? Sorry, I’m new with this.

You might be doing :

class LSTMModel(nn.Module):
    def __init__(self, input_size, hidden_size, output_size, num_layers=1):
        super().__init__()
        self.lstm = nn.LSTM(input_size, hidden_size, num_layers=num_layers, batch_first=True)
        self.linear = nn.Linear(hidden_size, output_size)
        init.xavier_uniform_(self.linear.weight)
        
    def forward(self, x): 
        # x is expected to be B, T, N  (as you created LSTM with batch first
        x, _ = self.lstm(x)  
        # x is now (B, T, hidden_size)
        x = self.linear(x)
        # x is now (B, T, output_size)
        x = x.softmax(axis=1)
        # now, every entry is >= 0 and <= 1 and guaranteed to have  x.sum(axis=-1) == 1
        return x

This would return , for every batch, for every time, a distribution over the output_size indices that is guaranteed to have components in [0, 1] and to sum to 1.0 ; If you want it to sum to 100 , just scale it by 100;

I said you have a classification settings because, the way you describe your problem, you have exclusive classes with 7 classes. WIth a softmax output, we produce a distribution over the 7 classes for every batch and every time instant;

Note however that if you were to train your network using a cross entropy loss, you might prefer not applying the softmax as the output of your model as the pytorch CrossEntropy loss embeds the softmax in its definition.

Is that clearer ?

Thanks very much for your explanation and example, clear When using softmax or crossentropyloss, should I scale my input target data first to a range of 0 to 1? (it is now 0 to 100). My terminology might not be entirely correct. With input target data I mean the ‘Y’ (the class values in this case) of the initial training and test data that is inputted into the model to learn from. I normalize my X data before, but for this to work do the Y also need to be rescaled first? I will play around with it!

Indeed, if you read the documentation on the cross entropy loss, it stated that your target can be either class indices or class probabilities.

Target: If containing class indices, shape ()(), (N)(N) or (N,d1,d2,…,dK)(N,d1​,d2​,…,dK​) with K≥1K≥1 in the case of K-dimensional loss where each value should be between [0,C)[0,C). If containing class probabilities, same shape as the input and each value should be between [0,1][0,1].

If you apply a softmax as an output activation function of your model, the softmax will output components in [0, 1].