2D CrossEntropyLoss for one-hot targets?

From the definition of CrossEntropyLoss:

input has to be a 2D Tensor of size (minibatch, C).
This criterion expects a class index (0 to C-1) as the target for each value of a 1D tensor of size

My last dense layer gives dim (mini_batch, 23*N_classes), then I reshape it to (mini_batch, 23, N_classes)
So for my task, I reshape the output of the last dense layer and softmax it along dim=2, so the output and target is of the following shape :

I get a predicted output that has shape (minibatch, 23, N_classes), so a given outputted sample is basically the predicted one_hot vector (each of the 23 rows can only be of one class)

What I want to do, basically, is to use CrossEntropyLoss for each of these rows of each sample, i.e. a 2D CrossEntropyLoss.

for a single sample, a given output (after softmax), looks something like : (Here is an example with smaller dimension, (n=2, 3, 5) instead of (n, 23, 25)

tensor([[[0, 1, 0, 0, 0],
         [0, 0, 0, 0, 1],
         [0, 1, 0, 0, 0]],

        [[1, 0, 0, 0, 0],
         [0, 0, 0, 1, 0],
         [0, 0, 0, 1, 0]]])

>>>x = torch.randn((2, 3, 5)).view(-1,3,5)
>>>out = F.softmax(x,dim = 2)
tensor([[[0.2093, 0.1281, 0.5016, 0.0836, 0.0773],
         [0.1146, 0.0575, 0.1194, 0.4064, 0.3021],
         [0.2026, 0.2265, 0.3767, 0.0556, 0.1385]],

        [[0.0473, 0.2789, 0.0782, 0.3650, 0.2306],
         [0.0054, 0.2677, 0.0643, 0.5199, 0.1427],
         [0.4490, 0.2397, 0.2088, 0.0787, 0.0237]]])

Is there a way to use the CrossEntropyLoss for a 2D target (so final dimension would be 3D, (batchsize, dim1, dim2)), i.e.

criterion = nn.CrossEntropyLoss()
loss = criterion(target, out)

Hi Richie!

Yes. CrossEntropyLoss supports what it calls the “K-dimensional case.”

Note, pytorch’s CrossEntropyLoss does not accept a one-hot-encoded
target – you have to use integer class labels instead.

Let’s call your value 23 length. Your input (the prediction generated
by your network) should have shape
[mini_batch, N_classes, length]

Your target (the ground-truth labels) should have shape
[mini_batch, length] without an N_classes dimension and should
be integer class labels in [0, N_classes - 1], inclusive. (In particular,
they are not one-hot encoded.)

Last, do not pass the output of your network through softmax();
CrossEntropyLoss has, in effect, softmax() built in (and expects
its input to be raw-score logits).


K. Frank

1 Like

Thanks a lot for the reply!

I didn’t quite understand that the ground truth couldn’t be one-hot encoded, while the output of my network should have switched shape [batch, N_classes, length] instead of [batch, length, N_classes].

But now, say my vector also has a positional encoding, so it is no longer of dimension [batch, length, N_classes] but now [batch, length, N_classes + 4], and can take two positive values (one for N_class, and one for the positional encoding), ex :

label = [0,1,0,0,0,...] #shape [1, 1, 21] for a single item and single row
pos = [0, 0, 1, 0] #shape [1, 1, 4] for a single item and single row
final_encoding = torch.cat([label,pos], dim = 2) #shape [1,1,25] for a single item/row

So now, my network predict both the class AND the positional encoding. If I slice my tensor and then run two different CrossEntropyLoss for each of the subtensor, will Autograd still work ?


Hi Richie!

Just to be clear, if you want to use pytorch’s CrossEntropyLoss, you
have to do it this way.

CrossEntropyLoss is written to take as its target integer class labels
rather than one-hot-encoded labels and doesn’t have a built-in feature
to use one-hot encoding. Similarly, CrossEntropyLoss expects the
second dimension of its input to be the class dimension, so that input
has shape [mini_batch, N_classes, d1, d2, ...], where d1, etc.,
are optional additional dimensions for the “K-dimensional case.”

(That’s just how it works. If you need something else, you would have
to write your own version or write a wrapper for CrossEntropyLoss.)

Yes, doing this makes perfect sense.

I don’t understand your use case or your “positional encoding,” so let
me use an artificial example:

Let’s say you have a set of images and each image contains both a
picture of an animal – cat, dog, mouse – and a digit – 0 though 9. You
want your network to classify both at the same time.

The input to your network (not to the loss criterion) is a batch of images
with shape [mini_batch, height, width], and your ground-truth
labels have shape [mini_batch, 2]. The “animal” labels are
labels[:, 0], a vector of length mini_batch, and are integer class
labels with values 0 (cat), 1 (dog), and 2 (mouse). The digit labels
are labels[:, 1] (also, of course, of length mini_batch) and have
values 0 through 9.

So far, so good.

Your network architecture is whatever it is – perhaps some initial
convolutional layers because you’re classifying two-dimensional
images – followed by some fully-connected layers. But your final
fully-connected layer should have nAnimal = 3 plus nDigit = 10
output features so that the output of your network (the input to your
loss criteria) is a tensor of shape [mini_batch, nAnimal + nDigit].

Now you slice input and labels and apply CrossEntropyLoss twice:

lossAnimal = torch.nn.CrossEntropyLoss() (input[:, :nAnimal], target[:, 0])
lossDigit = torch.nn.CrossEntropyLoss() (input[:, nAnimal:], target[:, 1])
lossTotal = lossAnimal + lossDigit

It’s fully legitimate to build your total loss out of multiple losses – in this
case two cross-entropy losses – and it will work just fine with autograd.

As you train your network by backpropagating totalLoss, your network,
and, in particular, your final fully-connected layer, will 'learn" that the first
nAnimal output features of the final layer are the predicted raw-score
logits for the animal in the image, and that the remaining nDigit output
features are the predicted logits for the digit in the image.


K. Frank

1 Like

Ah, thank you very much for the example, it is much clearer now. Because in my head, I was stuck with doing Softmax as the activation of my last layer (instead of leaving it as the logits) and absolutely wanted my network to apply the softmax and thus couldn’t do two labels at the same time given the dimension of the last layer.