Why the loss function can be apply on different size tensors

For example, I have a net that take tensor [N, 7](N is the samples num) as input and tensor [N, 4] as output, the “4” represents the different classes’ probabilities.
And the training data’s labels are the form of tensor [N], from range 0 to 3(represent the ground-truth class).

Here’s my question, I’ve seen some demos, they directly apply the loss function on the output tensor and label tensor. I wonder why this can work, since they have different size, and there sizes seems don’t fit the “broadcasting semantics”.

Here’s the minimal demo.

import torch
import torch.nn as nn
import torch.optim as optim

if __name__ == '__main__':
    features = torch.randn(2, 7)
    gt = torch.tensor([1, 1])
    model = nn.Sequential(
        nn.Linear(7, 4),
        nn.Linear(4, 4)
    optimizer = optim.SGD(model.parameters(), lr=0.005)
    f = nn.CrossEntropyLoss()

    for epoch in range(1000):
        output = model(features)
        loss = f(output, gt)