# 2D CrossEntropyLoss for one-hot targets?

From the definition of CrossEntropyLoss:

input has to be a 2D Tensor of size (minibatch, C).
This criterion expects a class index (0 to C-1) as the target for each value of a 1D tensor of size

My last dense layer gives dim (mini_batch, 23*N_classes), then I reshape it to (mini_batch, 23, N_classes)
So for my task, I reshape the output of the last dense layer and softmax it along dim=2, so the output and target is of the following shape :

I get a predicted output that has shape (minibatch, 23, N_classes), so a given outputted sample is basically the predicted one_hot vector (each of the 23 rows can only be of one class)

What I want to do, basically, is to use CrossEntropyLoss for each of these rows of each sample, i.e. a 2D CrossEntropyLoss.

for a single sample, a given output (after softmax), looks something like : (Here is an example with smaller dimension, (n=2, 3, 5) instead of (n, 23, 25)

``````>>>target
tensor([[[0, 1, 0, 0, 0],
[0, 0, 0, 0, 1],
[0, 1, 0, 0, 0]],

[[1, 0, 0, 0, 0],
[0, 0, 0, 1, 0],
[0, 0, 0, 1, 0]]])

>>>x = torch.randn((2, 3, 5)).view(-1,3,5)
>>>out = F.softmax(x,dim = 2)
>>>out
tensor([[[0.2093, 0.1281, 0.5016, 0.0836, 0.0773],
[0.1146, 0.0575, 0.1194, 0.4064, 0.3021],
[0.2026, 0.2265, 0.3767, 0.0556, 0.1385]],

[[0.0473, 0.2789, 0.0782, 0.3650, 0.2306],
[0.0054, 0.2677, 0.0643, 0.5199, 0.1427],
[0.4490, 0.2397, 0.2088, 0.0787, 0.0237]]])
``````

Is there a way to use the CrossEntropyLoss for a 2D target (so final dimension would be 3D, (batchsize, dim1, dim2)), i.e.

``````criterion = nn.CrossEntropyLoss()
loss = criterion(target, out)
``````

Hi Richie!

Yes. CrossEntropyLoss supports what it calls the “K-dimensional case.”

Note, pytorch’s `CrossEntropyLoss` does not accept a one-hot-encoded
`target` – you have to use integer class labels instead.

Let’s call your value `23` `length`. Your `input` (the prediction generated
by your network) should have shape
`[mini_batch, N_classes, length]`

Your `target` (the ground-truth labels) should have shape
`[mini_batch, length]` without an `N_classes` dimension and should
be integer class labels in `[0, N_classes - 1]`, inclusive. (In particular,
they are not one-hot encoded.)

Last, do not pass the output of your network through `softmax()`;
`CrossEntropyLoss` has, in effect, `softmax()` built in (and expects
its `input` to be raw-score logits).

Best.

K. Frank

1 Like

Thanks a lot for the reply!

I didn’t quite understand that the ground truth couldn’t be one-hot encoded, while the output of my network should have switched shape [batch, N_classes, length] instead of [batch, length, N_classes].

But now, say my vector also has a positional encoding, so it is no longer of dimension [batch, length, N_classes] but now [batch, length, N_classes + 4], and can take two positive values (one for N_class, and one for the positional encoding), ex :

``````label = [0,1,0,0,0,...] #shape [1, 1, 21] for a single item and single row
pos = [0, 0, 1, 0] #shape [1, 1, 4] for a single item and single row
final_encoding = torch.cat([label,pos], dim = 2) #shape [1,1,25] for a single item/row
``````

So now, my network predict both the class AND the positional encoding. If I slice my tensor and then run two different CrossEntropyLoss for each of the subtensor, will Autograd still work ?

Best,
Richie

Hi Richie!

Just to be clear, if you want to use pytorch’s `CrossEntropyLoss`, you
have to do it this way.

`CrossEntropyLoss` is written to take as its `target` integer class labels
rather than one-hot-encoded labels and doesn’t have a built-in feature
to use one-hot encoding. Similarly, `CrossEntropyLoss` expects the
second dimension of its `input` to be the class dimension, so that `input`
has shape `[mini_batch, N_classes, d1, d2, ...]`, where `d1`, etc.,
are optional additional dimensions for the “K-dimensional case.”

(That’s just how it works. If you need something else, you would have
to write your own version or write a wrapper for `CrossEntropyLoss`.)

Yes, doing this makes perfect sense.

I don’t understand your use case or your “positional encoding,” so let
me use an artificial example:

Let’s say you have a set of images and each image contains both a
picture of an animal – cat, dog, mouse – and a digit – 0 though 9. You
want your network to classify both at the same time.

The input to your network (not to the loss criterion) is a batch of images
with shape `[mini_batch, height, width]`, and your ground-truth
`labels` have shape `[mini_batch, 2]`. The “animal” labels are
`labels[:, 0]`, a vector of length `mini_batch`, and are integer class
labels with values `0` (cat), `1` (dog), and `2` (mouse). The digit labels
are `labels[:, 1]` (also, of course, of length `mini_batch`) and have
values `0` through `9`.

So far, so good.

Your network architecture is whatever it is – perhaps some initial
convolutional layers because you’re classifying two-dimensional
images – followed by some fully-connected layers. But your final
fully-connected layer should have `nAnimal = 3` plus `nDigit = 10`
output features so that the output of your network (the `input` to your
loss criteria) is a tensor of shape `[mini_batch, nAnimal + nDigit]`.

Now you slice `input` and `labels` and apply `CrossEntropyLoss` twice:

``````lossAnimal = torch.nn.CrossEntropyLoss() (input[:, :nAnimal], target[:, 0])
lossDigit = torch.nn.CrossEntropyLoss() (input[:, nAnimal:], target[:, 1])
lossTotal = lossAnimal + lossDigit
lossTotal.backward()
``````

It’s fully legitimate to build your total loss out of multiple losses – in this
case two cross-entropy losses – and it will work just fine with autograd.

As you train your network by backpropagating `totalLoss`, your network,
and, in particular, your final fully-connected layer, will 'learn" that the first
`nAnimal` output features of the final layer are the predicted raw-score
logits for the animal in the image, and that the remaining `nDigit` output
features are the predicted logits for the digit in the image.

Best.

K. Frank

1 Like

Ah, thank you very much for the example, it is much clearer now. Because in my head, I was stuck with doing Softmax as the activation of my last layer (instead of leaving it as the logits) and absolutely wanted my network to apply the softmax and thus couldn’t do two labels at the same time given the dimension of the last layer.