Question of usage of kl_divergence

When I want to use kl divergence, I find there are some different and strange use cases.
The formulation of KL divergence is
image
and the P should be the target distribution, and Q is the input distribution.

According to the API doc,

I assume the first args `input’ should be Q, and the second args target should be P.

However, I asked GPT, it gave me the answer below

It seems reversed. I also see some implementations that used the target distribution(P)
as the first args. This confuses me.

Why does the first args need to be used with the .log() function? If this is necessary, why don’t implement the .log inside the function `kl_div’?
Whether the first or second args should be the true target distribution (P)?
Moreover, I also see there is an argument named log_target. I still do not understand why we need this argument.

Appreciate any reply!

1 Like

To avoid underflow issues when computing this quantity, this loss expects the argument ‘input’ in the log_space. The argument ‘target’ may also be provided in the log_space if log_target=True

if not log_target:
loss = target * (target.log() - input)
else:
loss = target.exp() * (target_input)

Hi Zhen Qiang,

Thanks for your reply. But still confused.

When P is the true distribution, and Q is the predicted distribution,
may I say,
it should be F.kl_div(Q, P), because Q is input, and P is target.
But usually, to avoid the underflow issue, F.kl_div(P.log(), Q) is often used.

Is my understanding correct?

I

Yes, there is no problem with your understanding. See F.kl_div and torch.nn.KLDivLoss for more details.