Loss function for vectors when magnitude of elements and their position are both important

Context:

I am using a transformer for time series prediction.

The target and predicted tensors are both of size (8, 10, 181) which represents (batch_size, number of predictions, no. of elements in vector), where the last dimension is a 181-element vector representing the energy distributed across angles from 0-180. It is basically a heatmap of energy spread across 180 degrees. Each timestamp in this series has a heatmap vector associated with it.

If the target vector is [0, 0, 1, 0, 0], then from the following two predicted vectors, the first one is more ‘similar’ to the target than the second.
1st prediction–> [0, 1, 0, 0, 0]
2nd prediction–> [1, 0, 0, 0, 0]

The actual vectors look like this:

0.27,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-27.858261076760662,-11.692805944935555,0.0,0.0

While the vectors here are sparse, this may not always be the case depending on how the energy is distributed. Moreover, I have found that the initial predictions by the transformer contains non-zero values in almost all the elements.

The predicted and target vectors need to be similar in terms of magnitude and position for the model to be able to predict the heatmap for future timestamps in the time series.

Dot product, cosine similarity, and MSE, won’t work for this use case by themselves, so I thought to combine them.

I am using a combination of MSE loss and cosine similarity as follows in a custom loss function with a goal to minimise the MSE loss and maximise the cosine similarity. The cumulative sum operation is being done to aid the calculation of the cosine similarity.

import torch
import torch.nn as nn


class Custom_Loss(nn.Module):
    """
    Class containing an implementation of a custom loss function.
    """

    # Define cosine similarity and MSE as class variables.
    cos = nn.CosineSimilarity(dim=-1) # Dim = -1 as our tensor is (batch_size, preds, len_heatmap)
    mse = torch.nn.MSELoss()

    def __init__(self) -> None:
        """
        Constructor for custom_loss class
        """
        super(Custom_Loss, self).__init__()
    
    def forward(self, predicted, target):
        """
        Custom loss function that combines cosine similarity
        and MSE as:
        Loss = (w1 * MSE Loss) / (w2 * Cosine Similarity)

        Where: Mean of MSE and COS_SIM are used over length=no_of_preds.
        """

        # Define weights for functions for Cos and MSE.
        w1 = 1
        w2 = 10
        
        # Apply cumulative sum to both tensors and calculate loss.
        cos_sim = torch.abs(self.cos(torch.cumsum(predicted, dim=-1), torch.cumsum(target, dim=-1))).mean()
        mse_loss = self.mse(torch.cumsum(predicted, dim=-1), torch.cumsum(target, dim=-1))
        loss = (w1 * mse_loss) / (w2 * cos_sim)
        return loss

Problem:

After trying to train the transformer based on this loss function, I found that the training loss was high but kept decreasing, while the validation loss remained high throughout the training run. Varying the batch size and learning rates has also not changed things.

Questions:

  1. Are there any flaws with how this is written?
  2. Is this a reasonable approach to writing a loss function for the goal stated above?
  3. Would autograd be able to work with this when I do loss.backward()?
  4. Finally, for the model/use case I have described, is this a good enough loss function? Any suggestions regarding a better loss function are also welcome.

I’m no expert but I’d try and help with a few of your questions -

  1. The implementation looks okay. The line loss = (w1 * mse_loss) / (w2 * cos_sim) explains why you have high training and val loss. As long as the values are going down ,the model is learning { however if the heat maps are sparse (which might be the case), then it could just be learning to print 0 to reduce the loss }
  2. The approach is reasonable
  3. Autograd would work your custom loss function
  4. I would recommend you to use either minimize the MSE loss or maximize the cosine loss. They effectively achieve your end goal of making the predicted and targeted model to be equal in both magnitude and position. (Try training the transformer without your custom loss function, it might work)

Thank you, Puranjay!

The concern I have with using only the cosine similarity is that it will treat the following three vectors the same, even though the 3rd one is ‘closer’ to the first one if we assume the first to be the target:
[1, 0, 0, 0, 0] [0, 0, 1, 0, 0] [0, 1, 0, 0, 0]

Cosine similarity will be insensitive to that.

On the other hand, the concern I have with using only the MSE loss is that you could have multiple vectors give you the same MSE loss. For example, all of the above 3 vectors would have the same MSE loss.

I need to be able to predict which degrees (vector elements) will have the energy distributed over them for the future time steps. Hence, I thought to combine those two. Could there be a reasonable transformation I could apply to the vectors and then apply one of the conventional loss functions?

Okay. Thank you for explaining it again.
My guess is that this custom function will not serve your purpose and neither will the MSE loss and cosine similarity . These losses don’t encapsulate the ‘closeness’ of values in your output sequence.
I’m not sure how one will make a custom loss function which does this , however you shouldn’t worry about it. An accurate model gains understanding about the input data and thus your final , low loss model’s going to give outputs with sequences to similar to the label (similar in terms of closeness).
This is why the top-5 accuracy measure for image recognition models exists (one of many examples).

It would be interesting to come up with a loss like this though. Will give my input if I find something.

Thanks! I am also continuing to look for an appropriate loss function for this. Do let me know if you come across something.