Hi guys. Let’s say I want to predict three value(they add up to 1). The labels are also three values(they add up to 1).
I regard it as a distribution learning problem(not sure if I am right). so I am currently using the F.kl_div as my loss function. But I found that I can also use the MAE for each output to calculate the loss(a simple regression problem with 3 head). The experimental results are similar.
>>> logits # the output of my model
tensor([-0.1398, -0.1189, -0.9538])
tensor([0.1500, 0.2500, 0.6000])
# using kl_dev
>>> F.kl_div(F.log_softmax(logits.unsqueeze(0), dim=1), target.unsqueeze(0), reduction='batchmean')
# using MAE
>>> F.l1_loss(F.softmax(logits.unsqueeze(0), dim=1), target.unsqueeze(0), reduction='mean')
I wonder if I did it right? Which loss function should I use for this kind of problem. Or is there any other correct way to do this?
From the optimization problem prospective, little would change. Both will try to match your ground truth (Beware that L1 Norm is ok but L2 not). The difference lies in the interpretation that you give to the values of the loss and how they optimise your model.
Keeping it short. Concerning the interpretation, KL Divergence gives you the amount of information lost when your logits are used to approximate your target (one of the many interpretations). Which is much more meaningful than the L1 distance, that gives you the mean error that you commit when approximating the target with the logits. The two are somewhat related by this: Total variation distance of probability measures - Wikipedia.