Why the entropy term of KLDivLoss should be calculated? isn't it meaningless to compute gradients?

Since the first term of KLDivLoss[1] (the entropy of ground truth; y_true * log(y_true)) is constant, it is negligible when calculating gradients.
I also checked in my notebook[2] if the calculated gradient between KLDivLoss and CrossEntropyLoss is equal(Figure 1).
So, what is the use-case to calculate the entropy term of KLDivLoss? Isn’t it meaningless when computing gradient?

Figure 1: Comparison between gradient of CELoss and KLDivLoss.
kl_div


  1. KLDivLoss — PyTorch 2.1 documentation ↩︎

  2. kl_divergence.ipynb · GitHub ↩︎

My second question is, if they are equivalent, why these two functions are implemented separately. Is it simple to merge into a single function?

So, what is the use-case to calculate the entropy term of KLDivLoss?

I come up with a use-case. In contrastive learning, where the ground truth is also a prediction of a model, and back-propagating ground truth label.