Torch.nn.functional.kl_div doesn't work as expected

daehwa001210 · November 19, 2022, 11:24am

with torch.no_grad():
        original_outputs = self.classification_model(original_images.to(device))
        changed_outputs = self.classification_model(changed_images.to(device))

# first(pytorch function)
rewards = torch.sum(
        torch.nn.functional.kl_div(
            changed_outputs.log(), original_outputs, size_average = None, reduction="none"
        ),
        dim=1,
)
print((rewards < 0).sum())

# second(mine)
k = 0
for i in range(batch_size):
        if((original_outputs[i]*(original_outputs[i]/changed_outputs[i]).log()).sum() < 0):
            k += 1
print(k)

And my classification model has SIGMOID at the end.

Both outputs(original, changed) add up to 1 in 1 dimension… However, when I f.kl_div the two results (original_outputs, changed_ouputs), negative numbers often came out, and it was the same when I made a function.

KFrank · November 19, 2022, 6:07pm

Hi Daehwa!

This looks a bit odd. It would be difficult to pass a tensor through sigmoid()
and still have a given dimension sum to one.

When given “legal” arguments, kl_div() should, indeed, only return tensors
that sum along the probability-distribution dimension to non-negative values.

Could you verify that both original_outputs and changed_outputs are
valid probability distributions along dim = 1? All of the values should be
between zero and one and, as you’ve noted, they should sum to one along
dim = 1.

If you are still having your issue, please save a copy of the original_outputs
and changed_outputs that lead to negative values for the sums of kl_div()
and post a fully-self-contained, runnable script that reproduces your issue
(with hard-coded values for those two tensors), together with its output.

Best.

K. Frank

daehwa001210 · November 20, 2022, 2:50am

print(torch.sum(original_outputs,dim=1))    # first code
print((torch.sum(original_outputs, dim=1)!= 1).sum())    # second code

My first code prints 1.000 or 1. by batch_size.

However, the second code is often False when compared to 1.

It’s probably because of the floating point that the error is caused by not adding to 1 perfectly.

It seems that kl_div is also confused with these floating points.

tensor([4.6966e-10, 7.7465e-25, 2.4687e-21, 1.1027e-20, 6.1598e-15, 3.7311e-14, 1.0000e+00, 1.0739e-20, 1.8067e-15, 7.9641e-18], device='cuda:0')    # original output
tensor([4.2339e-10, 7.5789e-25, 7.2592e-21, 6.8198e-21, 5.8756e-15, 2.1480e-14, 1.0000e+00, 3.6700e-21, 9.5929e-16, 3.5948e-18], device='cuda:0')    # changed output

here is my hard-coded two tensor.

I always want kl_div to be output as a positive number. But I think this result is due to the error of sigmoid’s floating point. How should I solve it?

In addition, it’s not related to my problem, but won’t it unintentionally interfere with machine learning if kl_div comes out as a negative number?

Thank you for your help

daehwa001210 · November 21, 2022, 4:13am

my CNN model!

sorry. I misspelled softmax as sigmoid in the post above.
Still, the problem remains.

KFrank · November 22, 2022, 7:15pm

Hi Daehwa!

daehwa001210:

It seems that kl_div is also confused with these floating points.

tensor([4.6966e-10, 7.7465e-25, 2.4687e-21, 1.1027e-20, 6.1598e-15, 3.7311e-14, 1.0000e+00, 1.0739e-20, 1.8067e-15, 7.9641e-18], device='cuda:0')    # original output
tensor([4.2339e-10, 7.5789e-25, 7.2592e-21, 6.8198e-21, 5.8756e-15, 2.1480e-14, 1.0000e+00, 3.6700e-21, 9.5929e-16, 3.5948e-18], device='cuda:0')    # changed output

Yes, I can reproduce your issue using these tensors.

I do believe that this is an issue of (expected) floating-point error. Because
of round-off error, your probability-distributions tensors don’t sum to exactly
one (although you have to perform the sum in double precision to see this).
Therefore they are not exactly “legal” probability distributions and the
property that kl_div() will always be non-negative won’t necessarily hold
exactly.

Note, although we get a result that is negative, it is quite close to zero, so
the violation of non-negativity is very small – and, indeed, on the order of
round-off error.

Consider this illustration that uses the tensors you posted:

>>> import torch
>>> print (torch.__version__)
1.13.0
>>>
>>> original = torch.tensor([4.6966e-10, 7.7465e-25, 2.4687e-21, 1.1027e-20, 6.1598e-15, 3.7311e-14, 1.0000e+00, 1.0739e-20, 1.8067e-15, 7.9641e-18])
>>> changed = torch.tensor([4.2339e-10, 7.5789e-25, 7.2592e-21, 6.8198e-21, 5.8756e-15, 2.1480e-14, 1.0000e+00, 3.6700e-21, 9.5929e-16, 3.5948e-18])
>>>
>>> changed.sum() - 1            # sums to one in single precision
tensor(0.)
>>> changed.double().sum() - 1   # but not quite exactly one, so not quite a "legal" probability distribution
tensor(4.2342e-10, dtype=torch.float64)
>>>
>>> torch.equal (changed, original)      # not strictly equal
False
>>> torch.allclose (changed, original)   # equal to within some sort of numerical error, so kl_div should be about zero
True
>>>
>>> kld = torch.nn.functional.kl_div (original.log(), changed, reduction = 'none').sum()
>>> kld                                        # negative, but not by much
tensor(-4.3925e-11)
>>> torch.allclose (kld, torch.tensor (0.0))   # zero to within some sort of numerical error
True

As an aside, it isn’t really useful posting screen-shots of textual information.
If someone wanted to use this in a test or example, they wouldn’t be able
to copy-paste it.

As discussed above, yes, I think this is a result of round-off error. If you
really need the result of kl_div() to be strictly non-negative – and I doubt
that you do – you could clamp the result to be greater than or equal to zero.

No, a negative value for a loss will cause no problem.

It is true that many loss functions are non-negative and only become zero
when the predictions are “perfect” (including kl-div() in the absence of
round-off error), but the optimization and training don’t care.

Consider training a regression with MSELoss() – strictly non-negative and
only zero for perfect predictions. Then consider the same training with
MSELoss() - 17.0, which can now become negative (and is equal to
-17.0 for perfect predictions). The gradients are the same and the training
is the same. The optimizer seeks to push the loss function to algebraically
smaller numbers – including more negative numbers that are larger in
absolute value – rather than pushing the loss specifically toward zero.

Best.

K. Frank