# BCELoss Backward When The Input is Zero

Hello,

I want to check if my model is calculating the gradient ok or not.

I have used nn.BCELoss for the loss function. Based on the Pytorch Website, it is mentioned that the output of the loss will be clipped to -100 if it is smaller than this value.

My question is that what will happen in the backward? If the `y` is zero, because of the gradient it will be in denominator, and the gradient would be `inf`. There is no explanation about the backward in the site. Can you explain how should I calculate it? Should I add it with `epsilon`? or …

Also, I have seen a way for getting the gradient of loss respect to the prediction is using `hook`, but there are small examples and I do not know how should I do that? Can someone say what is the procedure?

Thanks,

Hi Amirali!

Just to be clear, the BCELoss documentation says that the individual
`log()` terms in `BCELoss` will be clamped to be `>= -100`, not the loss
itself.

My understanding of this (but I haven’t checked the code) is that
pytorch uses the “clamped-log” version of `BCELoss` rather than the
“true” `BCELoss`, and that the gradient produced by `backward()`
is the gradient of this modified loss function. That is, you modify
the loss function and then take the modified function’s gradient,
rather than modifying the loss function and its gradient separately.
(The derivative of the clamped-log term is well defined, and diverges
nowhere. It is zero in the clamped regime, but this doesn’t cause
any divergence.)

(As an aside, the documentation uses `x` for the `input` (prediction)
and `y` for the target. It is `x` that appears in the `log()` terms, so it is
`log (x)` that gets clamped. `backward()` also computes the gradient
with respect to `x` (not `y`).)

Best.

K. Frank

1 Like

Hello Frank,

Thanks for your answer. Yes, you are right about the `clamp-log`.I have implemented it, and I am getting same error values. Just for someone need to check, this is the code:

``````def BCE_Loss(x, y, weight):

clamp_log_x = np.log(x)
clamp_log_x[clamp_log_x <-100] = -100

clamp_log_1_x = np.log(1-x)
clamp_log_1_x[clamp_log_1_x <-100] = -100

cal_loss = np.multiply(y, clamp_log_x) + np.multiply(1-y, clamp_log_1_x)
cal_loss = np.multiply(cal_loss, -1*weight).sum()

return cal_loss
``````

The above function, calculate the BCE loss with weighting.

I implemented the gradient of the loss.
I compared my gradient to layers with Pytorch’s one, and because it was different, I used ‘Autograd’ to write a model with my own backward.

``````@staticmethod

``````

Before this model, I did not know how to access the gradient of the loss to the output of my previous model, so I had to calculate the weights of model and compare those to that. Now, with the above function, I was able to access the gradient of loss, so I tried to compare to mine. Interesting thing is that these two values are not equal.

My loss is weighting BCEloss with these code:

``````criterion = nn.BCELoss(reduction='none')
loss = criterion(predicted, heatmaps)
weight_loss = loss * weights
sum_loss = torch.sum(weight_loss, dim=1)
avg_loss = torch.mean(sum_loss)
``````

and in the backward, I just call:

``````avg_loss.backward()
``````

I am computing the gradient of BCEloss with this function:

``````def grad_BCEloss(x, y, weight):

temp_x = np.array([1/num if np.log(num) >=-100 else 0 for num in temp_x]])
temp_1_x = np.array([1/num if np.log(num) >=-100 else 0 for num in 1 - temp_x])

dloss_dx = weight*(-np.multiply(y, temp_x) + np.multiply(1-y, temp_1_x))

return dloss_dx
``````

The strange thing is that the value of my function with Pytorch’s one is not the same. There are input values:

``````weights: tensor([[0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033,
0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033,
0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033,
0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033,
0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033,
0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033,
0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033,
0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033,
0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033,
0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033,
0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033,
0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033,
0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033,
0.0033, 0.0033, 0.0033, 0.0033]])

X: tensor([[0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0674,
0.0766, 0.0763, 0.0668, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
0.0769, 0.1063, 0.1312, 0.1304, 0.1048, 0.0755, 0.0000, 0.0000, 0.0000,
0.0000, 0.0674, 0.1063, 0.1725, 0.2489, 0.2460, 0.1684, 0.1038, 0.0659,
0.0000, 0.0000, 0.0000, 0.0766, 0.1312, 0.2489, 0.5474, 0.4377, 0.2405,
0.1273, 0.0747, 0.0000, 0.0000, 0.0000, 0.0763, 0.1304, 0.2460, 0.4377,
0.4289, 0.2378, 0.1265, 0.0744, 0.0000, 0.0000, 0.0000, 0.0668, 0.1048,
0.1684, 0.2405, 0.2378, 0.1645, 0.1023, 0.0653, 0.0000, 0.0000, 0.0000,
0.0000, 0.0755, 0.1038, 0.1273, 0.1265, 0.1023, 0.0742, 0.0000, 0.0000,
0.0000, 0.0000, 0.0000, 0.0000, 0.0659, 0.0747, 0.0744, 0.0653, 0.0000,
0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,

Y: tensor([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0.]])
``````

Which each of are tensor of size [batch_size, 121, 1]. This my output:

``````dloss_dy: [[0.00334308 0.00334308 0.00334308 0.00334308 0.00334308 0.00334308
0.00334308 0.00334308 0.00334308 0.00334308 0.00334308 0.00334308
0.00334308 0.00334308 0.00334308 0.00334308 0.00334308 0.00334308
0.00334308 0.00334308 0.00334308 0.00334308 0.00334308 0.00334308
0.00334308 0.00334308 0.00358469 0.00362035 0.00361928 0.00358226
0.00334308 0.00334308 0.00334308 0.00334308 0.00334308 0.00334308
0.00362143 0.00374088 0.00384777 0.00384422 0.00373434 0.0036161
0.00334308 0.00334308 0.00334308 0.00334308 0.00358469 0.00374088
0.00403994 0.00445096 0.00443402 0.0040201  0.00373009 0.00357909
0.00334308 0.00334308 0.00334308 0.00362035 0.00384777 0.00445096
0.00738696 0.00594555 0.00440164 0.00383053 0.00361301 0.00334308
0.00334308 0.00334308 0.00361928 0.00384422 0.00443402 0.00594555
0.00585394 0.00438616 0.00382722 0.00361199 0.00334308 0.00334308
0.00334308 0.00358226 0.00373434 0.0040201  0.00440164 0.00438616
0.00400135 0.00372389 0.00357677 0.00334308 0.00334308 0.00334308
0.00334308 0.0036161  0.00373009 0.00383053 0.00382722 0.00372389
0.00361098 0.00334308 0.00334308 0.00334308 0.00334308 0.00334308
0.00334308 0.00357909 0.00361301 0.00361199 0.00357677 0.00334308
0.00334308 0.00334308 0.00334308 0.00334308 0.00334308 0.00334308
0.00334308 0.00334308 0.00334308 0.00334308 0.00334308 0.00334308
0.00334308]]
``````

Which makes sense that when `x` and `y` are `0`, the gradient is the weighting, but this is Pytorch’s one:

``````tensor([[[-0.0000, -0.0000, -0.0000, -0.0000, -0.0000, -0.0000, -0.0000, -0.0000,
-0.0000, -0.0000, -0.0000],
[-0.0000, -0.0000, -0.0000, -0.0000, -0.0000, -0.0000, -0.0000, -0.0000,
-0.0000, -0.0000, -0.0000],
[-0.0000, -0.0000, -0.0000, -0.0000, 0.0036, 0.0036, 0.0036, 0.0036,
-0.0000, -0.0000, -0.0000],
[-0.0000, -0.0000, -0.0000, 0.0036, 0.0037, 0.0038, 0.0038, 0.0037,
0.0036, -0.0000, -0.0000],
[-0.0000, -0.0000, 0.0036, 0.0037, 0.0040, 0.0045, 0.0044, 0.0040,
0.0037, 0.0036, -0.0000],
[-0.0000, -0.0000, 0.0036, 0.0038, 0.0045, 0.0074, 0.0059, 0.0044,
0.0038, 0.0036, -0.0000],
[-0.0000, -0.0000, 0.0036, 0.0038, 0.0044, 0.0059, 0.0059, 0.0044,
0.0038, 0.0036, -0.0000],
[-0.0000, -0.0000, 0.0036, 0.0037, 0.0040, 0.0044, 0.0044, 0.0040,
0.0037, 0.0036, -0.0000],
[-0.0000, -0.0000, -0.0000, 0.0036, 0.0037, 0.0038, 0.0038, 0.0037,
0.0036, -0.0000, -0.0000],
[-0.0000, -0.0000, -0.0000, -0.0000, 0.0036, 0.0036, 0.0036, 0.0036,
-0.0000, -0.0000, -0.0000],
[-0.0000, -0.0000, -0.0000, -0.0000, -0.0000, -0.0000, -0.0000, -0.0000,
-0.0000, -0.0000, -0.0000]]])
``````

A strange thing happen:

1. Where the `x` is not `0`, they have same value. (approximately)
2. Where the `x` is `0`, it is zero. (the first term because of `clamp-log` is zero, but the second term is (1-y)/(1-x) --> 1/1, so it would be multiply by weights, so it should have the value of weights.

Can anyone explain why is this happening?

Also, for BCELoss Backward When The Input is Zero, this function, I printed more values, when the total loss in below `1`, I have same value with `nn.BCELoss`, but in values more than `1`, I have a lot of difference. For example, with these values:

``````x: tensor([[0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0587, 0.0732,
0.0820, 0.0792, 0.0668, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0587,
0.0850, 0.1194, 0.1449, 0.1363, 0.1033, 0.0718, 0.0000, 0.0000, 0.0000,
0.0000, 0.0732, 0.1194, 0.2001, 0.2839, 0.2528, 0.1588, 0.0949, 0.0599,
0.0000, 0.0000, 0.0000, 0.0820, 0.1449, 0.2839, 0.5244, 0.4029, 0.2074,
0.1103, 0.0657, 0.0000, 0.0000, 0.0000, 0.0792, 0.1363, 0.2528, 0.4029,
0.3431, 0.1903, 0.1052, 0.0639, 0.0000, 0.0000, 0.0000, 0.0668, 0.1033,
0.1588, 0.2074, 0.1903, 0.1317, 0.0845, 0.0000, 0.0000, 0.0000, 0.0000,
0.0000, 0.0718, 0.0949, 0.1103, 0.1052, 0.0845, 0.0622, 0.0000, 0.0000,
0.0000, 0.0000, 0.0000, 0.0000, 0.0599, 0.0657, 0.0639, 0.0000, 0.0000,
0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,

y: tensor([[0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.1667, 0.2857, 0.5000, 0.6667,
0.5000, 0.2857, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.1818, 0.3333,
0.6667, 1.0000, 0.6667, 0.3333, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
0.1667, 0.2857, 0.5000, 0.6667, 0.5000, 0.2857, 0.0000, 0.0000, 0.0000,
0.0000, 0.0000, 0.1333, 0.2000, 0.2857, 0.3333, 0.2857, 0.2000, 0.0000,
0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.1333, 0.1667, 0.1818, 0.1667,
0.1333, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
0.0000, 0.0000, 0.0000, 0.0000]])
``````

The nn.BCELoss is `118.65484619140625`, but with my function it is `424.10596`. I do not know why in some places I get the right answer and somewhere not.

Can you help on this ? @ptrblck @albanD

I’m not completely sure, what you are trying to achieve, but you might saturate the sigmoid function with high or low values as shown in this post.

The reason I am doing all this is that after my model’s output, I have done some changes. The Pytorch is calculating the gradient, but the problem is that I have implemented that changes in another way and the value of gradient is different. (with the same initialization, same input and … ) Based on these, I think the Pytorch cannot calculate the gradient correctly, so I used `autograd` to define my backward too.

I am trying to check the BCELoss’s gradient and output. I have used sigmoid before these layers, but I do not think the problem is Sigmoid here.

I have `x` and `y`, and I am trying to compute the loss value and it’s gradient.

The loss value is same in 99% times, but the gradient is never same. I have out the code and the values that I am not getting in here.

In Pytorch implementation, is `clamp-log` is used for backward too? I mean is this statement right?

if the `log(x)` is smaller than `-100`, the gradient is `zero`
if the `log(x)` is bigger or equal than `-100`, the gradient is `1/x`

If yes, why I am not getting the same gradient to compare to Pytorch?