I’m working on an image classification problem, with 12 classes. I’d like to use test time augmentation by averaging the probabilities of each class over the augmented test images but I am uncertain as to the correct formula.
Currently, I just take the output of the final fc layer, treating it as unnormalized output, and apply a softmax across classes, interpreting the result as the probability of the class. This is what I average over, taking argmax after averaging.
The loss function is cross-entropy, which involves log(softmax) and so my question is: am I interpreting properly the probabilities in my scheme above, or should I do something different.
The geometric mean of the ‘probabilities’ from the softmax layer provide the best result for this. If you’ve got the output of log-softmax, you can just take the (arithmetic) mean as it’s equivalent to the geometric of the non-log values. Then yes, take the argmax of either of those…
do you know if there is a fundamental argument for using the Geometric mean rather than the arithmetic or is it more empirical?
If I assume that the prediction is a noisy estimate of the “true PD” with sufficiently nice noise, my intuition would lead me to an arithmetic estimate. Also, the geometric mean of probabilities doesn’t sum to 1 (in general), so you’d need to re-normalize:
a = torch.randn(2,4)
p = a.softmax(1)
p_gm = (p[0]**0.5*p[1]**0.5)
print(p_gm.sum())
@tom My usage is based on empirical results, geometric mean of the probabilities for ensembling predictions from a NN has produced better results for me on many occasions.
I searched for theory once, as the intuition didn’t sit well, especially when thinking about the case when one ensemble member has a near 0 output. In practice though, it works. I suppose a ‘try both’ answer might be more appriopriate.
In my theory search, the papers that provided some insight in the context of NN were about dropout