I’m working on an image classification problem, with 12 classes. I’d like to use test time augmentation by averaging the probabilities of each class over the augmented test images but I am uncertain as to the correct formula.
Currently, I just take the output of the final fc layer, treating it as unnormalized output, and apply a softmax across classes, interpreting the result as the probability of the class. This is what I average over, taking argmax after averaging.
The loss function is cross-entropy, which involves log(softmax) and so my question is: am I interpreting properly the probabilities in my scheme above, or should I do something different.
Grateful for any advice.
The geometric mean of the ‘probabilities’ from the softmax layer provide the best result for this. If you’ve got the output of log-softmax, you can just take the (arithmetic) mean as it’s equivalent to the geometric of the non-log values. Then yes, take the argmax of either of those…
do you know if there is a fundamental argument for using the Geometric mean rather than the arithmetic or is it more empirical?
If I assume that the prediction is a noisy estimate of the “true PD” with sufficiently nice noise, my intuition would lead me to an arithmetic estimate. Also, the geometric mean of probabilities doesn’t sum to 1 (in general), so you’d need to re-normalize:
a = torch.randn(2,4)
p = a.softmax(1)
p_gm = (p**0.5*p**0.5)
@tom My usage is based on empirical results, geometric mean of the probabilities for ensembling predictions from a NN has produced better results for me on many occasions.
I searched for theory once, as the intuition didn’t sit well, especially when thinking about the case when one ensemble member has a near 0 output. In practice though, it works. I suppose a ‘try both’ answer might be more appriopriate.
In my theory search, the papers that provided some insight in the context of NN were about dropout
Thanks for the answers. Very helpful.