When you take the mean or the variance, you are using distances on the events and that minimizing the (squared) distance has a meaning.
The binomial distribution is a distribution of events that are counts. Counting just gives a very natural notion of distance of events – when @ptrblck has 6 heads out of 10 coin tosses and I have 5 and Sarah has 4, he will be much more lucky than her and a bit more lucky than me. It makes sense to say that 6 is of distance 2 to 4 and of distance 1 to 5 and that 5 is closer to 6 than 4 is and that the expected number of heads is 5.
Now for the categorical distribution, events are the categories and you have no notion of distances. It makes no sense to say that a a daisy (985) is much closer to a vulcano (980) than a great white shark (1). You would not say that the the average outcome is between a chopping knife (499) and a cliff dwelling (500) – or the average between the two.
So here the Categorical distribution refusing to give you mean and variance reminds you of this fact.
It’s closely related to why we don’t use square distance on class numbers. It just doesn’t work well.
That said, you know your use case best and if you have in fact more structure than just categorical values, taking the mean might make sense for you.
Best regards
Thomas
P.S.: Predicting word vectors for imagenet is much more fun and gives you back metrics. But then it’s not the categorical distribution anymore.