Context:
I am using a transformer for time series prediction.
The target and predicted tensors are both of size (8, 10, 181) which represents (batch_size, number of predictions, no. of elements in vector), where the last dimension is a 181-element vector representing the energy distributed across angles from 0-180. It is basically a heatmap of energy spread across 180 degrees. Each timestamp in this series has a heatmap vector associated with it.
If the target vector is [0, 0, 1, 0, 0], then from the following two predicted vectors, the first one is more ‘similar’ to the target than the second.
1st prediction–> [0, 1, 0, 0, 0]
2nd prediction–> [1, 0, 0, 0, 0]
The actual vectors look like this:
0.27,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-27.858261076760662,-11.692805944935555,0.0,0.0
While the vectors here are sparse, this may not always be the case depending on how the energy is distributed. Moreover, I have found that the initial predictions by the transformer contains non-zero values in almost all the elements.
The predicted and target vectors need to be similar in terms of magnitude and position for the model to be able to predict the heatmap for future timestamps in the time series.
Dot product, cosine similarity, and MSE, won’t work for this use case by themselves, so I thought to combine them.
I am using a combination of MSE loss and cosine similarity as follows in a custom loss function with a goal to minimise the MSE loss and maximise the cosine similarity. The cumulative sum operation is being done to aid the calculation of the cosine similarity.
import torch
import torch.nn as nn
class Custom_Loss(nn.Module):
"""
Class containing an implementation of a custom loss function.
"""
# Define cosine similarity and MSE as class variables.
cos = nn.CosineSimilarity(dim=-1) # Dim = -1 as our tensor is (batch_size, preds, len_heatmap)
mse = torch.nn.MSELoss()
def __init__(self) -> None:
"""
Constructor for custom_loss class
"""
super(Custom_Loss, self).__init__()
def forward(self, predicted, target):
"""
Custom loss function that combines cosine similarity
and MSE as:
Loss = (w1 * MSE Loss) / (w2 * Cosine Similarity)
Where: Mean of MSE and COS_SIM are used over length=no_of_preds.
"""
# Define weights for functions for Cos and MSE.
w1 = 1
w2 = 10
# Apply cumulative sum to both tensors and calculate loss.
cos_sim = torch.abs(self.cos(torch.cumsum(predicted, dim=-1), torch.cumsum(target, dim=-1))).mean()
mse_loss = self.mse(torch.cumsum(predicted, dim=-1), torch.cumsum(target, dim=-1))
loss = (w1 * mse_loss) / (w2 * cos_sim)
return loss
Problem:
After trying to train the transformer based on this loss function, I found that the training loss was high but kept decreasing, while the validation loss remained high throughout the training run. Varying the batch size and learning rates has also not changed things.
Questions:
- Are there any flaws with how this is written?
- Is this a reasonable approach to writing a loss function for the goal stated above?
- Would autograd be able to work with this when I do loss.backward()?
- Finally, for the model/use case I have described, is this a good enough loss function? Any suggestions regarding a better loss function are also welcome.