Own BCELoss implementation gradients deviate slightly from pytorch version

Hi, I have a question regarding a custom loss function. I am trying to design a loss function that suits my purpose (its related to adversarial attacks). I want to create a function that is based on BCELoss() so I started off with my own BCELoss() implementation:

class MlaLoss(nn.Module):
    def __init__(self, weight=None, size_average=True):
        super(MlaLoss, self).__init__()
 
    def forward(self, x, y):        
        
        positive_loss = torch.clamp(torch.log(x), min=-100)
        negative_loss = torch.clamp(torch.log(1-x), min=-100)
        loss = -torch.mean(y * positive_loss + (1-y) * negative_loss)
        return loss

When I compare the loss to pytorch its BCELoss() I get the same values, the backpropagated gradients however deviate slightly, which degrades my attack performance significantly (if not drastically)

loss1 = nn.BCELoss()
loss2 = MlaLoss()

input1 = torch.rand(1000, 1000,  requires_grad=True)
input2 = torch.clone(input1).detach()
input2.requires_grad = True
target = torch.ones((1000,1000), requires_grad=False)

cost1 = loss1(input1,target)
cost2 = loss2(input2,target)

cost1.backward()
cost2.backward()

print(torch.sum(cost1 - cost2))
print(torch.sum(input1.grad.data - input2.grad.data))

The outputs are as follows:

tensor(0., grad_fn=<SumBackward0>)
tensor(-5.5235e-08)

How do I fix this deviation? Any suggestions or help would be very much appreciated!

Thanks in advance.

Hi,
I’m not sure, but you should check the backward method of BCE in PyTorch C source.
Here are two of them:

both of them utilize d(L)/d(x) = -w (y - x) / (x - x^2) as backward method.
But they’re different in using epsilon.
First one uses d(L)/d(x) = -w*(y - x) / max((x - x^2),EPS)
and for the second one, derivative is -w*(y - x) / ((1. - x + EPS) * (x + EPS))
I go with the first one.
BTW, if you want to use GPU, you should check what the exact implementation is for it.
still I don’t think that this small difference causes your problem. It’s in the floating-point error range.

Thank you for your reply. Im using torch so the second one applies to my case. Then I guess that the backward operations of the operations used in my loss function do not align with the backward used in here https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/Loss.cpp#L272. Does the backward of a division also work with an epsilon?

¯\_(ツ)_/¯
Probably yes!