For example, I have a net that take tensor [N, 7](N is the samples num) as input and tensor [N, 4] as output, the “4” represents the different classes’ probabilities.

And the training data’s labels are the form of tensor [N], from range 0 to 3(represent the ground-truth class).

Here’s my question, I’ve seen some demos, they directly apply the loss function on the output tensor and label tensor. I wonder why this can work, since they have different size, and there sizes seems don’t fit the “broadcasting semantics”.

Here’s the minimal demo.

```
import torch
import torch.nn as nn
import torch.optim as optim
if __name__ == '__main__':
features = torch.randn(2, 7)
gt = torch.tensor([1, 1])
model = nn.Sequential(
nn.Linear(7, 4),
nn.ReLU(),
nn.Linear(4, 4)
)
optimizer = optim.SGD(model.parameters(), lr=0.005)
f = nn.CrossEntropyLoss()
for epoch in range(1000):
optimizer.zero_grad()
output = model(features)
loss = f(output, gt)
loss.backward()
optimizer.step()
```