Pytorch distribution.mean returns nan

I’m playing around with torch.distributions (specifically Categorical) and I noticed that if I initialize a categorical distribution and try to get its mean with distribution.mean it returns nan. This also happens if you try to get distribution.variance.

Simple code example:

probs = th.tensor([0.2,0.3,0.5])
m = Categorical(probs=probs)
print("probs =", m.probs)
print("True mean (using formula) =", th.sum(probs*th.arange(len(probs))))
print("True mean =", m.mean)
print("True variance =", m.variance)

Is there a reason that this happens?

1 Like

Hello Adamo!

Let me speculate:

If you understand your categories – 0, 1, 2, … – to be numerical
values, the mean and standard deviation of the Categorical
distribution make perfectly good sense.

But I imagine people might have been thinking of the of categories
as being more generally non-numerical – cat, dog, bird, … – for which
the mean and standard deviation don’t really make sense. (What do
you get when you average a cat and a dog? Answer: A smart dog.)

Looked at this way I could see not implementing mean, etc., or
returning nan for it. (I would prefer that mean, etc., be implemented
because they do make sense, at least for some use cases.)

Perhaps some forum participant could replace my idle speculation
with the real history and the reason …


K. Frank


Thanks KFrank! While that does make some sense to me, I still find it confusing since when you sample from the distribution you get numerical values (0 to C-1 where C is the # of categories) so if I can sample from the distribution I should be able to get its mean if it’s straightforward to compute (as is the case with a Categorical distribution).

I think it comes down to what @KFrank explained.
The mean and stddev in a categorical distribution doesn’t make sense, as there is no innate ordering in the values.
The example of images makes it clear. Alternatively, we could also use an example for words.
If we assign each word from a text to a label, what would be the “mean word” and what would the standard deviation mean in this context?

I see your point but I don’t really agree. For example, in pytorch I can get the mean and variance of a binomial distribution. However, I could make the same argument: that the support of the binomial distribution, while being represented with some subset of the natural numbers {0, 1, 2, …, n}, is actually defined over a finite set of things that have no inherent order. Thus in this case taking the mean doesn’t make sense either. I get that this is a silly way to think about binomial distributions, but it’s consistent with what you said. I think that it would make sense to have the mean of the categorical distribution defined as sum_x {x*p(x)}. We assume and ordering on the support for all of the other distributions, it makes sense to me to do it for the categorical distribution too.

Either way I guess I can always just write my own subclass of Categorical that defines mean and variance in the way that I expect >:)

That sounds like a valid assumption, so let’s ask a mathematician, as I could be completely wrong.
CC @tom Could you help us out here, please? :slight_smile:

Maybe it would make sense to have a distributions.Multinoulli :smiley:

When you take the mean or the variance, you are using distances on the events and that minimizing the (squared) distance has a meaning.

The binomial distribution is a distribution of events that are counts. Counting just gives a very natural notion of distance of events – when @ptrblck has 6 heads out of 10 coin tosses and I have 5 and Sarah has 4, he will be much more lucky than her and a bit more lucky than me. It makes sense to say that 6 is of distance 2 to 4 and of distance 1 to 5 and that 5 is closer to 6 than 4 is and that the expected number of heads is 5.

Now for the categorical distribution, events are the categories and you have no notion of distances. It makes no sense to say that a a daisy (985) is much closer to a vulcano (980) than a great white shark (1). You would not say that the the average outcome is between a chopping knife (499) and a cliff dwelling (500) – or the average between the two.
So here the Categorical distribution refusing to give you mean and variance reminds you of this fact.

It’s closely related to why we don’t use square distance on class numbers. It just doesn’t work well.

That said, you know your use case best and if you have in fact more structure than just categorical values, taking the mean might make sense for you.

Best regards


P.S.: Predicting word vectors for imagenet is much more fun and gives you back metrics. But then it’s not the categorical distribution anymore. :slight_smile:


Hi. I understand what you mean, but I strongly recommend you to open a possibility of giving mean and variance with Categorical distribution. Because it might be useful in some cases, such as ordinal regression.