Why would BCELoss be >1?

I am playing around with the BCELoss and BCEWithLogitsLoss, and I don’t quite understand why these losses (with size_average set to True) can ever exceed 1.

A slightly modified version of the random example from the docs never runs into this problem:

m = nn.Sigmoid()
loss = nn.BCELoss()
input = autograd.Variable(torch.randn(3,3,3,3), requires_grad=True)
target = autograd.Variable(torch.FloatTensor(3,3,3,3).random_(2))
output = loss(m(input), target)
print output

But if I change input to be

input = autograd.Variable(torch.zeros(3,3,3,3), requires_grad=True)
target = autograd.Variable(torch.FloatTensor(3,3,3,3).random_(2))
output = loss(input, target)
print output

or

input = autograd.Variable(torch.ones(3,3,3,3), requires_grad=True)
target = autograd.Variable(torch.FloatTensor(3,3,3,3).random_(2))
output = loss(input, target)
print output

I can get losses exceeding 1. This is true even if I add/subtract some eps=0.01 to/from the input before evaluating the loss (because 0’s and 1’s are never truly output from a sigmoid).

How can this be the case? It’s an average entropy, and entropy is bounded on the [0,1] interval. This happens with both BCELoss (where I am excluding the sigmoid with the 0’s and 1’s inputs) and BCEWithLogitsLoss.

I’m starting with a random initialization, so I’d expect to predict a lot of randomness (e.g. a loss near 0.5)

Is that true? BCE for one element x and target t is defined as
- ( t log(x) + (1-t) log (1-x)
If we take t -> 1 (from the bottom), x -> 0 (from the top) we get something very large.

That’s a good point. I was conflating Shannon entropy (plogp) with the crossy-entropy loss.