Question of usage of kl_divergence

Rizhao_Cai · September 20, 2024, 6:30am

When I want to use kl divergence, I find there are some different and strange use cases.
The formulation of KL divergence is

and the P should be the target distribution, and Q is the input distribution.

According to the API doc,

I assume the first args `input’ should be Q, and the second args target should be P.

However, I asked GPT, it gave me the answer below

It seems reversed. I also see some implementations that used the target distribution（P）
as the first args. This confuses me.

Why does the first args need to be used with the .log() function? If this is necessary, why don’t implement the .log inside the function `kl_div’?
Whether the first or second args should be the true target distribution (P)?
Moreover, I also see there is an argument named log_target. I still do not understand why we need this argument.

Appreciate any reply!

zhen0qiang · September 20, 2024, 11:31am

To avoid underflow issues when computing this quantity, this loss expects the argument ‘input’ in the log_space. The argument ‘target’ may also be provided in the log_space if log_target=True

if not log_target:
loss = target * (target.log() - input)
else:
loss = target.exp() * (target_input)

Rizhao_Cai · September 20, 2024, 2:49pm

Hi Zhen Qiang,

Thanks for your reply. But still confused.

When P is the true distribution, and Q is the predicted distribution,
may I say,
it should be F.kl_div(Q, P), because Q is input, and P is target.
But usually, to avoid the underflow issue, F.kl_div(P.log(), Q) is often used.

Is my understanding correct?

I

zhen0qiang · September 22, 2024, 4:05am

Yes, there is no problem with your understanding. See F.kl_div and torch.nn.KLDivLoss for more details.