BCEWithLogit gives different loss than BCELoss(torch.sigmoid())

Hi Guys,

So I am struggling with my model, my loss sometimes explodes and I don’t understand why.

This is the model that goes crazy:


class MLPgocrazy(torch.nn.Module):
   def __init__(self, layersize, dropout):
       super(MLPgocrazy, self).__init__()
       self.f1=nn.Linear(layersize[0],layersize[1])
       self.f2=nn.Linear(layersize[1],layersize[2])
       self.f3=nn.Linear(layersize[2],layersize[3])
       self.f4=nn.Linear(layersize[3],layersize[4])
       self.f5=nn.Linear(layersize[4],layersize[5])
       self.bn1 = nn.BatchNorm1d(layersize[1])
       self.bn2 = nn.BatchNorm1d(layersize[2])
       self.bn3 = nn.BatchNorm1d(layersize[3])
       self.bn4 = nn.BatchNorm1d(layersize[4])
       self.bn5 = nn.BatchNorm1d(layersize[5])
       
       self.dropout= dropout
       
   def forward(self, x):

       x = F.relu(self.f1(x))
       #x = F.dropout(x, self.dropout, training=self.training)
       #x = self.bn1(x)
        
       x = F.relu(self.f2(x))
       #x = F.dropout(x, self.dropout, training=self.training)
       #x = self.bn2(x)
        
       x = F.relu(self.f3(x))
       #x = F.dropout(x, self.dropout, training=self.training)
      # x = self.bn3(x)
        
       x = F.relu(self.f4(x))
       #x = F.dropout(x, self.dropout, training=self.training)
       #fingerprint= x = self.bn4(x)
       fingerprint=1
       x = self.f5(x)
       return(x, fingerprint)

I wanted to see what happens, in most cases I have a decent loss but sometimes the loss explodes and is unreasonable high, so that I thought something is wrong with my loss function.

If I run this example:


torch.manual_seed(1234)
model = MLPgocrazy([2048,1000,1000,1000,1024,1],0.3)
model.cuda()


loss_function =nn.BCEWithLogitsLoss()
optimizer = optim.Adam(model.parameters(),lr=0.1)
loss_function2= nn.BCELoss()


eloss =[]
eloss2 =[]
for i in range(5):

    optimizer.zero_grad()    
    output, fp=model(cv_train_x[0][i].cuda())
    
    
    loss=loss_function(output, cv_train_y[0][i].cuda())
    eloss2.append(loss_function2(torch.sigmoid(output), cv_train_y[0][i].cuda()).detach().cpu().numpy())
    
    loss.backward()
        
    optimizer.step()
    eloss.append(loss.detach().cpu().numpy())

Then my eloss2 (BCELoss):

[array(0.69223803, dtype=float32),
 array(12.088572, dtype=float32),
 array(11.225102, dtype=float32),
 array(9.498163, dtype=float32),
 array(17.269388, dtype=float32)]

looks like this. But my eloss (BCEWithLogits) looks like this:

[array(0.69223803, dtype=float32),
 array(8991448., dtype=float32),
 array(1325363., dtype=float32),
 array(998557.4, dtype=float32),
 array(9826264., dtype=float32)]

I have no idea what causes that behavior.

I would recommend to e.g. lower the learning rate to avoid the exploding loss.

Your outputs might be saturating, so that sigmoid + nn.BCELoss runs into its numerical limits.

1 Like

Okay thanks,

I purposefully raised it that high for hyperopt, but I guess I conclude that that is way to high.
Does such problems also effect the performance of batchnorm layers. I have had also issue where my was lr=0.1. And using model.eval() also led to exploding validation loss?

Hi Janosch!

This looks odd to me. I am not surprised that you get different results for
eloss (BCEWithLogitsLoss()) and eloss2 (BCELoss (sigmoid()),
and I agree with @ptrblck that BCELoss (sigmoid()) is prone to run
into numerical instabilities. But your result looks backwards, in that the
BCEWithLogitsLoss() looks “worse.”

(Just to make sure we’re on the same page, yes, the two versions of
the loss are mathematically equivalent – they differ in their numerical
errors, with BCELoss (sigmoid()) being systematically worse.)

Here is a (version 0.3.0) script that simply runs a single number through
both versions of the loss:

import torch
torch __version__

for  t in range (2):
    print ('t =', t)
    tf = torch.autograd.Variable (torch.FloatTensor ([t]), requires_grad = False)
    td = torch.autograd.Variable (torch.DoubleTensor ([t]), requires_grad = False)
    for  i in range (-50, 51, 5):
        pf = torch.autograd.Variable (torch.FloatTensor ([i]), requires_grad = False)
        pd = torch.autograd.Variable (torch.DoubleTensor ([i]), requires_grad = False)
        f1 = torch.nn.functional.binary_cross_entropy_with_logits (pf, tf)
        d1 = torch.nn.functional.binary_cross_entropy_with_logits (pd, td)
        f2 = torch.nn.functional.binary_cross_entropy (torch.nn.functional.sigmoid (pf), tf)
        d2 = torch.nn.functional.binary_cross_entropy (torch.nn.functional.sigmoid (pd), td)
        print (i, f1.data[0], d1.data[0], f2.data[0], d2.data[0])


And here is the output:

>>> torch.__version__
'0.3.0b0+591e73e'
...
t = 0
-50 0.0 0.0 -1.000088900582341e-12 -1.000088900581841e-12
-45 0.0 0.0 -1.000088900582341e-12 -1.000088900581841e-12
-40 0.0 0.0 -1.000088900582341e-12 -1.000088900581841e-12
-35 0.0 6.661338147750937e-16 -9.99422766767566e-13 -9.994227667670665e-13
-30 0.0 9.348077867343381e-14 -9.063860773039778e-13 -9.06386077303567e-13
-25 0.0 1.388800185954457e-11 1.2887912959058667e-11 1.2887912959141716e-11
-20 0.0 2.0611536900435727e-09 2.0601538253117724e-09 2.060153605389284e-09
-15 3.576278118089249e-07 3.0590227379725525e-07 3.0590126698371023e-07 3.0590127369244887e-07
-10 4.541770613286644e-05 4.5398899216870535e-05 4.5398901420412585e-05 4.539889821679735e-05
-5 0.006715348921716213 0.006715348489117967 0.006715348456054926 0.006715348488111341
0 0.6931471824645996 0.6931471805599453 0.6931471824645996 0.6931471805579453
5 5.006715297698975 5.006715348489118 5.00671911239624 5.006715348339723
10 10.000045776367188 10.000045398899218 9.99958610534668 10.000045376872722
15 15.0 15.000000305902274 14.843770027160645 14.999997036667164
20 20.0 20.000000002061153 27.63102149963379 19.9995149186454
25 25.0 25.000000000013888 27.63102149963379 24.930465471679888
30 30.0 30.000000000000092 27.63102149963379 27.54165513275042
35 35.0 35.0 27.63102149963379 27.630355203882424
40 40.0 40.0 27.63102149963379 27.631021115928547
45 45.0 45.0 27.63102149963379 27.631021115928547
50 50.0 50.0 27.63102149963379 27.631021115928547
t = 1
-50 50.0 50.0 27.63102149963379 27.631021115735674
-45 45.0 45.0 27.63102149963379 27.631021087303363
-40 40.0 40.0 27.631017684936523 27.631016867583316
-35 35.0 35.0 27.630390167236328 27.63039080294151
-30 30.0 30.000000000000092 27.541568756103516 27.541567845574413
-25 25.0 25.000000000013888 24.930469512939453 24.930469367097377
-20 20.0 20.000000002061153 19.999515533447266 19.999514954519324
-15 15.0 15.000000305902274 14.99999713897705 14.999997036889244
-10 10.000045776367188 10.000045398899218 10.000045776367188 10.000045376871752
-5 5.006715297698975 5.006715348489118 5.006715297698975 5.006715348339704
0 0.6931471824645996 0.6931471805599453 0.6931471824645996 0.6931471805579453
5 0.006715348921716213 0.006715348489117967 0.006715324241667986 0.006715348488111229
10 4.541770613286644e-05 4.5398899216870535e-05 4.541976886685006e-05 4.539889821679735e-05
15 3.576278118089249e-07 3.0590227379725525e-07 3.5762693073593255e-07 3.059012738034712e-07
20 0.0 2.0611536900435727e-09 -1.000088900582341e-12 2.0601537164115865e-09
25 0.0 1.388800185954457e-11 -1.000088900582341e-12 1.288802398144418e-11
30 0.0 9.348077867343381e-14 -1.000088900582341e-12 -9.066081219084919e-13
35 0.0 6.661338147750937e-16 -1.000088900582341e-12 -9.994227667670665e-13
40 0.0 0.0 -1.000088900582341e-12 -1.000088900581841e-12
45 0.0 0.0 -1.000088900582341e-12 -1.000088900581841e-12
50 0.0 0.0 -1.000088900582341e-12 -1.000088900581841e-12

As I would expect, BCELoss (sigmoid() goes wrong (e.g., gives a
negative value) for values for which BCEWithLogitsLoss() remains
correct. So I’m surprised that your BCELoss (sigmoid()) look more
sensible.

(I will note that in my test there are a few cases in which
BCELoss (sigmoid()) better agrees with the Double version
of BCEWithLogitsLoss() than does the Float version of
BCEWithLogitsLoss(). I don’t know what to make of this.)

Best.

K. Frank

Hi Frank,

thanks for the elaborate answer. I guess my takeaway is lower lr and do not use BCELoss :slight_smile: