There have been questions in the past that reveal that under the hood, the cross_entropy calculation uses the natural log rather than log_2. I’m curious if anyone knows why that choice was made?
On the one hand, the loss distributions are still similarly shaped, although they are now scaled slightly differently. On the other hand, using base 2 fits more with the information-theoretic definition of cross-entropy.
Was there a specific reason the natural log is used?
Is it computationally more efficient, or simpler?
Does it matter at all in anyone’s practical experience?
That is indeed the only difference. To get the log-2 version of cross entropy
from pytorch’s cross entropy, multiply by 1 / log (2). This has almost no
effect. For example, with plain-vanilla SGD as your optimizer, this rescaling
can be absorbed into the learning rate.
My guess would be because log() (i.e., “natural log”) is the most “standard”
version of log.
It really doesn’t matter in the least. It’s just a modest rescaling by a factor
that is pretty close to one.