Hi I have a quick question about kl divergence loss in Pytorch

Is it okay to use sigmoid instead of softmax for input? Most of the case I noticed that softmax probability distribution is used, but I wonder does that make sense to use kl divergence loss for multi-label target.

I want to use KL divergence loss by giving more than two dimension input and target. In this case, does KL div loss compare the distritbution between last dimension of input and target? Or does it compare all the dimension except for the batch dim?

KLDivLoss doesnâ€™t care what any of the dimesions â€“ including
the batch dimension â€“ are. It simply performs an element-wise
computation and takes the mean (but see its documentation for
how its reduction = 'batchmean' and deprecated reduce and size_average constructor arguments affect its treatment of the
batch dimension).

>>> import torch
>>> torch.__version__
'1.9.0'
>>> _ = torch.manual_seed (2021)
>>> input = -torch.randn (3, 5).abs()
>>> target = torch.rand (3, 5)
>>> torch.nn.KLDivLoss() (input, target)
<path_to_pytorch_install>\torch\nn\functional.py:2742: UserWarning: reduction: 'mean' divides the total loss by both the batch size and the support size.'batchmean' divides only by the batch size, and aligns with the KL div math definition.'mean' will be changed to behave the same as 'batchmean' in the next major release.
"reduction: 'mean' divides the total loss by both the batch size and the support size."
tensor(0.3705)
>>> torch.nn.KLDivLoss() (input.T, target.T)
tensor(0.3705)
>>> torch.nn.KLDivLoss (reduction = 'none') (input, target).mean()
tensor(0.3705)

Whether the result of KLDivLoss represents a proper Kullbackâ€“Leibler
divergence across certain dimensions depends on whether your input and target are themselves proper (log-)probability distributions
across those dimensions.