How exactly should I understand the cross entropy loss function?

Hello. I know this question’s been asked quite a lot on a variety of communities but I’m still having trouble grasping it.

I’m currently implementing the continuous bag-of-words (CBOW) model using PyTorch. I’m facing some problems when implementing the cross entropy loss, though. Here’s the portion of code that’s causing the problem:

for idx, sample in enumerate(self.train_data):
    x = torch.tensor(sample[0], dtype=torch.long)
    y = np.zeros(shape=(self.vocab_size)) # self.vocab_size = 85,000
    y[int(sample[1])] = np.float64(1)
    y = torch.tensor(y, dtype=torch.long)

    if torch.cuda.is_available():
        x = x.cuda()
        y = y.cuda()


    output = self.model(x) # output's shape is the same as self.vocab_size
    loss = criterion(output, y)

To briefly explain my code, the model that I’ve implemented basically outputs the averaged embedding values of a context array and performs a linear projection to project them into a shape that’s identical to the size of the vocabulary. Then we run this array through a softmax function.

The contents of self.train_data are basically (context, target_word) pairs. y is a one-hot encoded array of the token.

I’m aware that the second input to nn.CrossEntropyLoss is C = # of classes, but I’m not sure where my code went wrong. The vocabulary size is 85,000 and so aren’t the number of class 85,000?

If I change the input to

loss = criterion(output, 85000)

I get the same error:

*** RuntimeError: Dimension out of range (expected to be in range of [-1, 0], but got 1)

What am I doing wrong, and how should I understand the input to PyTorch’s cross entropy loss?


let us take an example,
if we have an input tensor, which could be classified into 10 classes, and we want to find CrossEntropyLoss.

inp = torch.randn(1, 10)

so, we get input to be something like (this input would be output of our neural network)

tensor([[ 1.4775,  0.7022,  0.7499, -0.7535, -1.4983, -2.3193, -0.6166, -2.3302,
         -0.8847, -0.1915]])

now, we have a target class, which we know is the correct class, let us assume it to be 2.
so, we do,

loss_fn = nn.CrossEntropyLoss()
loss_fn(inp, torch.tensor([2]))

now, this will take softmax of our input, which would give something like this,

sftmx = nn.Softmax(dim=1)
tensor([[0.3918, 0.1804, 0.1892, 0.0421, 0.0200, 0.0088, 0.0483, 0.0087, 0.0369,

which means that probability that our input could be one of the 10 classes.
so, here probabillity that our input corresponds to target 2, is 0.1892
so, we will get loss as -log(0.1892), which is


so, this is our CrossEntropyLoss, now as we train, we want our model to change the input we are giving to CrossEntropyLoss, so that this loss reduces.

by default the mean is taken, for example, if we specify,

loss_fn = nn.CrossEntropyLoss(reduction='none')

then it would give you all values for as many predictions you had to make.
so if we had input as

inp = torch.randn(2, 10)

and target as

target = torch.tensor([3, 4])

and then apply loss function to it, then it will give something like,

tensor([3.6951, 2.6064])

which means that loss for first prediction is 3.6951, loss for second prediction is 2.6064.


Hi, thanks for the reply. I’m still having trouble following how to understand the cross entropy loss. Judging by your answer, it seems that the C that we’re giving our loss function is the correct class?

However, in my case I still get the same runtime error if I do

loss(output, torch.tensor([3]))

where output.shape = 85997.

Also, if C = # of classes as the documentation says, shouldn’t C = torch.tensor([10]) in the example you gave?

Thank you for this clear explanation.

one doubt that I have is why do we use


from what I understand is that, it has to do with number of bits that would be required to represent that probability
for example, if our model predicted a probability of 1 for the correct class, then


would be zero, so we do not need any bits, the model’s prediction is correct for what class our input belongs to
if our model gives a probability of 0.5 for the correct class, then


would be 1 [assume we take log base 2, as (log of input base e) is ((log of input base 2)/(log of e base 2)), so maybe it does not make a difference]

so it is interpreted as our model needs 1 bit to represent this probability, that is something like,


if prob was 0.25, then 2 bits, something like,


so we want to reduce this number of bits required to represent our prediction, but I am not sure about this, I think there is something more behind use of -log