I am playing around with the idea of doing a NLP type prediction for time series data. In my data, it often is the case that labels cluster together, so a ‘2’ label, will often times precede another ‘2’ label, and so forth. For that reason, I thought it would be interesting, instead of predicting each label in isolation, to try and predict a sequence of them where the output would have access to the recent labels, as is common in translation tasks and NLP transformers. The issue I’m having is knowing what loss function would work.

In my case, I have 3 classes I am trying to predict, and would like to predict each class 24 timesteps into the future. Therefore, the input to the loss function would be something like (256, 24, 3) => (batch, predicted sequence, logits).

In the PyTorch Cross entropy docs, it says the following: “The performance of this criterion is generally better when target contains class indices, as this allows for optimized computation.” I’m assuming that class indices means to not one-hot encode, and just keep them class indices. (in my case 0,1, or 2).

So that would mean that my ground truth matrix would be of shape (256, 24) or, to match dimensions (256, 24, 1). Where am I misunderstanding the function of Crossentropyloss? In order to do some tests, I ran the following code.

```
import torch
import torch.nn.functional as F
# Method 1
x = torch.randn(2,10,3) # (2 examples, 10 prediction length, 3 logits)
y = torch.tensor([[1,2,0,0,0,1,2,0,2,1],
[1,2,0,0,2,1,2,1,2,2]]).unsqueeze(dim=-1)
# Method 2
x = torch.randn(10,3)
y = torch.tensor([1,0,0,0,2,2,2,3,1,0])
class Myloss(Module):
def __init__(self):
super(Myloss, self).__init__()
self.loss_function = CrossEntropyLoss()
def forward(self, y_pre, y_true):
y_true = y_true.type(torch.float32).to(DEVICE)
loss = self.loss_function(y_pre, y_true)
return loss
loss = Myloss()
loss(x,y)
```

I’m also curious as to why the second method works, and the first doesn’t.