BCELoss Backward When The Input is Zero

NoobCoder · June 8, 2020, 1:12am

Hello,

I want to check if my model is calculating the gradient ok or not.

I have used nn.BCELoss for the loss function. Based on the Pytorch Website, it is mentioned that the output of the loss will be clipped to -100 if it is smaller than this value.

My question is that what will happen in the backward? If the y is zero, because of the gradient it will be in denominator, and the gradient would be inf. There is no explanation about the backward in the site. Can you explain how should I calculate it? Should I add it with epsilon? or …

Also, I have seen a way for getting the gradient of loss respect to the prediction is using hook, but there are small examples and I do not know how should I do that? Can someone say what is the procedure?

Thanks,

KFrank · June 8, 2020, 1:28pm

Hi Amirali!

Just to be clear, the BCELoss documentation says that the individual
log() terms in BCELoss will be clamped to be >= -100, not the loss
itself.

My understanding of this (but I haven’t checked the code) is that
pytorch uses the “clamped-log” version of BCELoss rather than the
“true” BCELoss, and that the gradient produced by backward()
is the gradient of this modified loss function. That is, you modify
the loss function and then take the modified function’s gradient,
rather than modifying the loss function and its gradient separately.
(The derivative of the clamped-log term is well defined, and diverges
nowhere. It is zero in the clamped regime, but this doesn’t cause
any divergence.)

(As an aside, the documentation uses x for the input (prediction)
and y for the target. It is x that appears in the log() terms, so it is
log (x) that gets clamped. backward() also computes the gradient
with respect to x (not y).)

Best.

K. Frank

NoobCoder · June 8, 2020, 9:28pm

Hello Frank,

Thanks for your answer. Yes, you are right about the clamp-log.I have implemented it, and I am getting same error values. Just for someone need to check, this is the code:

def BCE_Loss(x, y, weight):

    clamp_log_x = np.log(x)
    clamp_log_x[clamp_log_x <-100] = -100

    clamp_log_1_x = np.log(1-x)
    clamp_log_1_x[clamp_log_1_x <-100] = -100

    cal_loss = np.multiply(y, clamp_log_x) + np.multiply(1-y, clamp_log_1_x)
    cal_loss = np.multiply(cal_loss, -1*weight).sum()

    return cal_loss

The above function, calculate the BCE loss with weighting.

NoobCoder · June 10, 2020, 1:00am

I implemented the gradient of the loss.
I compared my gradient to layers with Pytorch’s one, and because it was different, I used ‘Autograd’ to write a model with my own backward.

@staticmethod
    def backward(ctx, grad_output):

        print("grad:", grad_output.shape)
        print(grad_output)

Before this model, I did not know how to access the gradient of the loss to the output of my previous model, so I had to calculate the weights of model and compare those to that. Now, with the above function, I was able to access the gradient of loss, so I tried to compare to mine. Interesting thing is that these two values are not equal.

My loss is weighting BCEloss with these code:

criterion = nn.BCELoss(reduction='none')
loss = criterion(predicted, heatmaps)
weight_loss = loss * weights
sum_loss = torch.sum(weight_loss, dim=1)
avg_loss = torch.mean(sum_loss)

and in the backward, I just call:

avg_loss.backward()

I am computing the gradient of BCEloss with this function:

def grad_BCEloss(x, y, weight):

    temp_x = np.array([1/num if np.log(num) >=-100 else 0 for num in temp_x]])
    temp_1_x = np.array([1/num if np.log(num) >=-100 else 0 for num in 1 - temp_x])

    dloss_dx = weight*(-np.multiply(y, temp_x) + np.multiply(1-y, temp_1_x))
    
    return dloss_dx

The strange thing is that the value of my function with Pytorch’s one is not the same. There are input values:

weights: tensor([[0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033,
         0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033,
         0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033,
         0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033,
         0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033,
         0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033,
         0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033,
         0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033,
         0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033,
         0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033,
         0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033,
         0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033,
         0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033, 0.0033,
         0.0033, 0.0033, 0.0033, 0.0033]])


X: tensor([[0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
         0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
         0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0674,
         0.0766, 0.0763, 0.0668, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
         0.0769, 0.1063, 0.1312, 0.1304, 0.1048, 0.0755, 0.0000, 0.0000, 0.0000,
         0.0000, 0.0674, 0.1063, 0.1725, 0.2489, 0.2460, 0.1684, 0.1038, 0.0659,
         0.0000, 0.0000, 0.0000, 0.0766, 0.1312, 0.2489, 0.5474, 0.4377, 0.2405,
         0.1273, 0.0747, 0.0000, 0.0000, 0.0000, 0.0763, 0.1304, 0.2460, 0.4377,
         0.4289, 0.2378, 0.1265, 0.0744, 0.0000, 0.0000, 0.0000, 0.0668, 0.1048,
         0.1684, 0.2405, 0.2378, 0.1645, 0.1023, 0.0653, 0.0000, 0.0000, 0.0000,
         0.0000, 0.0755, 0.1038, 0.1273, 0.1265, 0.1023, 0.0742, 0.0000, 0.0000,
         0.0000, 0.0000, 0.0000, 0.0000, 0.0659, 0.0747, 0.0744, 0.0653, 0.0000,
         0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
         0.0000, 0.0000, 0.0000, 0.0000]], grad_fn=<ViewBackward>)


Y: tensor([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0.]])

Which each of are tensor of size [batch_size, 121, 1]. This my output:

dloss_dy: [[0.00334308 0.00334308 0.00334308 0.00334308 0.00334308 0.00334308
  0.00334308 0.00334308 0.00334308 0.00334308 0.00334308 0.00334308
  0.00334308 0.00334308 0.00334308 0.00334308 0.00334308 0.00334308
  0.00334308 0.00334308 0.00334308 0.00334308 0.00334308 0.00334308
  0.00334308 0.00334308 0.00358469 0.00362035 0.00361928 0.00358226
  0.00334308 0.00334308 0.00334308 0.00334308 0.00334308 0.00334308
  0.00362143 0.00374088 0.00384777 0.00384422 0.00373434 0.0036161
  0.00334308 0.00334308 0.00334308 0.00334308 0.00358469 0.00374088
  0.00403994 0.00445096 0.00443402 0.0040201  0.00373009 0.00357909
  0.00334308 0.00334308 0.00334308 0.00362035 0.00384777 0.00445096
  0.00738696 0.00594555 0.00440164 0.00383053 0.00361301 0.00334308
  0.00334308 0.00334308 0.00361928 0.00384422 0.00443402 0.00594555
  0.00585394 0.00438616 0.00382722 0.00361199 0.00334308 0.00334308
  0.00334308 0.00358226 0.00373434 0.0040201  0.00440164 0.00438616
  0.00400135 0.00372389 0.00357677 0.00334308 0.00334308 0.00334308
  0.00334308 0.0036161  0.00373009 0.00383053 0.00382722 0.00372389
  0.00361098 0.00334308 0.00334308 0.00334308 0.00334308 0.00334308
  0.00334308 0.00357909 0.00361301 0.00361199 0.00357677 0.00334308
  0.00334308 0.00334308 0.00334308 0.00334308 0.00334308 0.00334308
  0.00334308 0.00334308 0.00334308 0.00334308 0.00334308 0.00334308
  0.00334308]]

Which makes sense that when x and y are 0, the gradient is the weighting, but this is Pytorch’s one:

tensor([[[-0.0000, -0.0000, -0.0000, -0.0000, -0.0000, -0.0000, -0.0000, -0.0000,
          -0.0000, -0.0000, -0.0000],
         [-0.0000, -0.0000, -0.0000, -0.0000, -0.0000, -0.0000, -0.0000, -0.0000,
          -0.0000, -0.0000, -0.0000],
         [-0.0000, -0.0000, -0.0000, -0.0000, 0.0036, 0.0036, 0.0036, 0.0036,
          -0.0000, -0.0000, -0.0000],
         [-0.0000, -0.0000, -0.0000, 0.0036, 0.0037, 0.0038, 0.0038, 0.0037,
          0.0036, -0.0000, -0.0000],
         [-0.0000, -0.0000, 0.0036, 0.0037, 0.0040, 0.0045, 0.0044, 0.0040,
          0.0037, 0.0036, -0.0000],
         [-0.0000, -0.0000, 0.0036, 0.0038, 0.0045, 0.0074, 0.0059, 0.0044,
          0.0038, 0.0036, -0.0000],
         [-0.0000, -0.0000, 0.0036, 0.0038, 0.0044, 0.0059, 0.0059, 0.0044,
          0.0038, 0.0036, -0.0000],
         [-0.0000, -0.0000, 0.0036, 0.0037, 0.0040, 0.0044, 0.0044, 0.0040,
          0.0037, 0.0036, -0.0000],
         [-0.0000, -0.0000, -0.0000, 0.0036, 0.0037, 0.0038, 0.0038, 0.0037,
          0.0036, -0.0000, -0.0000],
         [-0.0000, -0.0000, -0.0000, -0.0000, 0.0036, 0.0036, 0.0036, 0.0036,
          -0.0000, -0.0000, -0.0000],
         [-0.0000, -0.0000, -0.0000, -0.0000, -0.0000, -0.0000, -0.0000, -0.0000,
          -0.0000, -0.0000, -0.0000]]])

A strange thing happen:

Where the x is not 0, they have same value. (approximately)
Where the x is 0, it is zero. (the first term because of clamp-log is zero, but the second term is (1-y)/(1-x) --> 1/1, so it would be multiply by weights, so it should have the value of weights.

Can anyone explain why is this happening?

NoobCoder · June 10, 2020, 2:19am

Also, for BCELoss Backward When The Input is Zero, this function, I printed more values, when the total loss in below 1, I have same value with nn.BCELoss, but in values more than 1, I have a lot of difference. For example, with these values:

x: tensor([[0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
         0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
         0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0587, 0.0732,
         0.0820, 0.0792, 0.0668, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0587,
         0.0850, 0.1194, 0.1449, 0.1363, 0.1033, 0.0718, 0.0000, 0.0000, 0.0000,
         0.0000, 0.0732, 0.1194, 0.2001, 0.2839, 0.2528, 0.1588, 0.0949, 0.0599,
         0.0000, 0.0000, 0.0000, 0.0820, 0.1449, 0.2839, 0.5244, 0.4029, 0.2074,
         0.1103, 0.0657, 0.0000, 0.0000, 0.0000, 0.0792, 0.1363, 0.2528, 0.4029,
         0.3431, 0.1903, 0.1052, 0.0639, 0.0000, 0.0000, 0.0000, 0.0668, 0.1033,
         0.1588, 0.2074, 0.1903, 0.1317, 0.0845, 0.0000, 0.0000, 0.0000, 0.0000,
         0.0000, 0.0718, 0.0949, 0.1103, 0.1052, 0.0845, 0.0622, 0.0000, 0.0000,
         0.0000, 0.0000, 0.0000, 0.0000, 0.0599, 0.0657, 0.0639, 0.0000, 0.0000,
         0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
         0.0000, 0.0000, 0.0000, 0.0000]], grad_fn=<ViewBackward>)


y: tensor([[0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.1667, 0.2857, 0.5000, 0.6667,
         0.5000, 0.2857, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.1818, 0.3333,
         0.6667, 1.0000, 0.6667, 0.3333, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
         0.1667, 0.2857, 0.5000, 0.6667, 0.5000, 0.2857, 0.0000, 0.0000, 0.0000,
         0.0000, 0.0000, 0.1333, 0.2000, 0.2857, 0.3333, 0.2857, 0.2000, 0.0000,
         0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.1333, 0.1667, 0.1818, 0.1667,
         0.1333, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
         0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
         0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
         0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
         0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
         0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
         0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
         0.0000, 0.0000, 0.0000, 0.0000]])

The nn.BCELoss is 118.65484619140625, but with my function it is 424.10596. I do not know why in some places I get the right answer and somewhere not.

Can you help on this ? @ptrblck @albanD

ptrblck · June 10, 2020, 5:36am

I’m not completely sure, what you are trying to achieve, but you might saturate the sigmoid function with high or low values as shown in this post.

NoobCoder · June 10, 2020, 5:46am

Thanks for your reply.

The reason I am doing all this is that after my model’s output, I have done some changes. The Pytorch is calculating the gradient, but the problem is that I have implemented that changes in another way and the value of gradient is different. (with the same initialization, same input and … ) Based on these, I think the Pytorch cannot calculate the gradient correctly, so I used autograd to define my backward too.

I am trying to check the BCELoss’s gradient and output. I have used sigmoid before these layers, but I do not think the problem is Sigmoid here.

I have x and y, and I am trying to compute the loss value and it’s gradient.

The loss value is same in 99% times, but the gradient is never same. I have out the code and the values that I am not getting in here.

In Pytorch implementation, is clamp-log is used for backward too? I mean is this statement right?

if the log(x) is smaller than -100, the gradient is zero
if the log(x) is bigger or equal than -100, the gradient is 1/x

If yes, why I am not getting the same gradient to compare to Pytorch?