Why the entropy term of KLDivLoss should be calculated? isn't it meaningless to compute gradients?

bilzrd · December 30, 2022, 9:10am

Since the first term of KLDivLoss^[1] (the entropy of ground truth; y_true * log(y_true)) is constant, it is negligible when calculating gradients.
I also checked in my notebook^[2] if the calculated gradient between KLDivLoss and CrossEntropyLoss is equal(Figure 1).
So, what is the use-case to calculate the entropy term of KLDivLoss? Isn’t it meaningless when computing gradient?

Figure 1: Comparison between gradient of CELoss and KLDivLoss.
kl_div

bilzrd · December 30, 2022, 9:22am

My second question is, if they are equivalent, why these two functions are implemented separately. Is it simple to merge into a single function?

bilzrd · December 30, 2022, 9:29am

So, what is the use-case to calculate the entropy term of KLDivLoss?

I come up with a use-case. In contrastive learning, where the ground truth is also a prediction of a model, and back-propagating ground truth label.