Custom loss functions

Don’t use the .data attribute, as it might yield these unwanted side effects.
Is your code working now or are you still stuck?

In the second run,
y = autoencoder(x_t,judge=True)
y become nan
y:tensor([[[nan, nan, nan, …, nan, nan, nan]]], grad_fn=)

Check if the normalization inside the model (in particular the torch.std(x) output) yields valid values and no Infs or NaNs.

I think the problem is not what you said,gentleman.In the second run,I debugged the code and found the weights of all the nn.Conv1d and nn.ConvTranspose1d become nan.

ipdb> self.enc1.weight.data
tensor([[[nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan]],
But in the first run,all the weights exist.

Correction:in the second run ,run here actually mean epoch,‘in the second epoch’,‘in the first epoch’

HI Ptrblck,

Sorry, I need to use p-Sigmoid function.
I wrote it here.Would you please help me to write it much better? I think it would be good to remove for loops, to take less time.
I appreciate your help


    def forward(self,input):
        Sigma=self.Sigma
        Beta=self.Beta
        Alpha=self.Alpha
        Out=torch.zeros(input.shape)
        for ii in range(input.shape[0]):
            for ii1 in range(input.shape[1]):
                for ii2 in range(input.shape[2]):
                    for ii3 in range(input.shape[3]):
                        vv=input[ii,ii1,ii2,ii3]
   
                        Out[ii,ii1,ii2,ii3]=Sigma*(1/(1+torch.exp((-1*(vv*Alpha))+Beta)))

        return Out

I don’t know the shapes of the used tensors, but assuming Sigma, Beta, and Alpha are all scalar tensors, you can directly use the mentioned formula without the loops:

input = torch.randn(10, 10, 10, 10)
Sigma = torch.randn(1)
Beta = torch.randn(1)
Alpha = torch.randn(1)
Out = torch.zeros(input.shape)
for ii in range(input.shape[0]):
    for ii1 in range(input.shape[1]):
        for ii2 in range(input.shape[2]):
            for ii3 in range(input.shape[3]):
                vv=input[ii,ii1,ii2,ii3]
                Out[ii,ii1,ii2,ii3]=Sigma*(1/(1+torch.exp((-1*(vv*Alpha))+Beta)))

out_fast = Sigma*(1/(1+torch.exp((-1*(input*Alpha))+Beta)))
print((out_fast == Out).all())
> True

Many thanks.Indeed, I want to customized the Sigmoid. I write my own class.

I would appreciate your look, I need to be sure about the code.

class SigmoiidLearn(nn.Module):
    def __init__(self):
        super(SigmoiidLearn, self).__init__()
        
        self.Alpha = nn.Parameter(torch.ones(1), requires_grad=True)
        self.Beta = nn.Parameter(torch.zeros(1),requires_grad=True)   
        self.Sigma = nn.Parameter(torch.ones(1),requires_grad=True)

    def forward(self,input):
        Sigma=self.Sigma
        Beta=self.Beta
        Alpha=self.Alpha
        Out=torch.zeros(input.shape)
        for ii in range(input.shape[0]):
            for ii1 in range(input.shape[1]):
                for ii2 in range(input.shape[2]):
                    for ii3 in range(input.shape[3]):
                        vv=input[ii,ii1,ii2,ii3]

                        Out[ii,ii1,ii2,ii3]=Sigma*(1/(1+torch.exp((-1*(vv*Alpha))+Beta)))

        return Out

class Generator(nn.Module):
    def __init__(self,ngpu,nz,ngf):
        super(Generator, self).__init__()
        self.ngpu=ngpu
        self.nz=nz
        self.ngf=ngf
           
        self.l1= nn.Sequential( nn.ConvTranspose2d(self.nz, self.ngf * 8, 3, 1, 0, bias=False),
            nn.BatchNorm2d(self.ngf * 8),
            nn.ReLU(True))
        self.l2=nn.Sequential(nn.ConvTranspose2d(self.ngf * 8, self.ngf * 4, 3, 1, 0, bias=False),
            nn.BatchNorm2d(self.ngf * 4),
            nn.ReLU(True))      # state size. (ngf*4) x 8 x 8
        self.l3=nn.Sequential( nn.ConvTranspose2d( self.ngf * 4, self.ngf * 2, 3, 1, 0, bias=False),
            nn.BatchNorm2d(self.ngf * 2),
            nn.ReLU(True))
        self.l4= nn.Sequential( nn.ConvTranspose2d( self.ngf*2, 1, 3, 1, 0, bias=False), nn.BatchNorm2d(1))

        self.l5=nn.Sequential(SigmoiidLearn())

    # forward method
    def forward(self, input1):
        x = self.l1(input1)
        x =  self.l2(x)
        x = self.l3(x)
        x = self.l4(x)
        x = self.l5(x)
        return x

This is great @ptrblck can you share a similar dummy function for Binary Cross Entropy (BCELoss). Thanks a lot :slightly_smiling_face:

Sure, here is the raw implementation rewritten directly from the docs as well as the stable internal implementation, which doesn’t overflow for large values:

def my_bce_with_logits_loss(x, y):
    loss = -1.0 * (y * F.logsigmoid(x) + (1 - y) * torch.log(1 - torch.sigmoid(x)))
    loss = loss.mean()
    return loss

def my_bce_with_logits_loss_stable(x, y):
    max_val = (-x).clamp_min_(0)
    loss = (1 - y) * x + max_val + torch.log(torch.exp(-max_val) + torch.exp(-x - max_val))
    loss = loss.mean()
    return loss


criterion = nn.BCEWithLogitsLoss()

batch_size = 5
nb_classes = 1

# small values
x = torch.randn(batch_size, nb_classes, requires_grad=True)
y = torch.empty(batch_size, nb_classes).uniform_(0, 1)

loss_reference = criterion(x, y)
loss = my_bce_with_logits_loss(x, y)
loss_stable = my_bce_with_logits_loss_stable(x, y)

print(loss_reference)
>tensor(1.0072, grad_fn=<BinaryCrossEntropyWithLogitsBackward>)

print(loss_reference - loss)
> tensor(0., grad_fn=<SubBackward0>)

print(loss_reference - loss_stable)
> tensor(0., grad_fn=<SubBackward0>)


# large values
x = torch.randn(batch_size, nb_classes, requires_grad=True) * 100
y = torch.empty(batch_size, nb_classes).uniform_(0, 1)

loss_reference = criterion(x, y)
loss = my_bce_with_logits_loss(x, y)
loss_stable = my_bce_with_logits_loss_stable(x, y)

print(loss_reference)
> tensor(12.1431, grad_fn=<BinaryCrossEntropyWithLogitsBackward>)

print(loss_reference - loss)
> tensor(-inf, grad_fn=<SubBackward0>)

print(loss_reference - loss_stable)
> tensor(0., grad_fn=<SubBackward0>)

thanks a lot thank you so much

I have a question on this if you don’t mind, the stable version uses the logsumexp trick, is that like using softmax with 2 classes?

My other question has to do with identifying this part (1 - y) * torch.log(1 - torch.sigmoid(x)) corresponding to the stable version? In other words is this (1 - y) * torch.log(1 - torch.sigmoid(x)) stable on its own or is it a better way to implement it?

hi, @ptrblck , i have two questions:

  1. when i multiply the standard MSE loss with a negative value, and then backward, will it enlarge the distance of pred and label? or will it still minimize the distance of pred and label?

dummy code:

loss = torch.nn.MSELoss(reduction="none")
loss_v = -10 * loss(pred,label).mean()
loss_v.backward()

i mean, will the .backward() function automatically ignore the -10 outside the loss in BP?

  1. I have two different models and their output are pred1 and pred2, labels are label1 and label2, when i apply a loss function ( such as torch.nn.CrossEntropyLoss()) on them and backward each loss separately, do i need to define the loss class twice? or they can share the same loss class without affect each other?

dummy code:

pred1 = model1(input1)
pred2 = model2(input2)

# option1
metric = torch.nn.CrossEntropyLoss()
loss1 = metric(pred1,label1)
loss2 = metric(pred2,label2)

# option2
metric1 = torch.nn.CrossEntropyLoss()
metric2 = torch.nn.CrossEntropyLoss()
loss1 = metric1(pred1,label1)
loss2 = metric2(pred2,label2)

loss1.backward()
loss2.backward()

should i use option1 or option2? if two model’s BP share same metric function, will they affect each other in gradient backward?

Look forward to your reply, thanks.

  1. The training will try to lower the loss, so if you negate the loss value, the model will diverge and will try to make the loss “more negative”. Here is a simple example, which shows that the expected parameter will just get more negative values, as it decreases the loss:
torch.manual_seed(1234)

# standard approach
x = torch.ones(1, 1)
y = torch.ones(1, 1) * 2

lin = nn.Linear(1, 1, bias=False)
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(lin.parameters(), lr=1e-3)

for epoch in range(6000):
    optimizer.zero_grad()
    out = lin(x)
    loss = criterion(out, y)
    loss.backward()
    optimizer.step()
    print('epoch {}, loss {}'.format(epoch, loss.item()))
    print(out, y)


# negative loss
x = torch.ones(1, 1)
y = torch.ones(1, 1) * 2

lin = nn.Linear(1, 1, bias=False)
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(lin.parameters(), lr=1e-3)

for epoch in range(6000):
    optimizer.zero_grad()
    out = lin(x)
    loss = criterion(out, y)
    loss = -1. * loss
    loss.backward()
    optimizer.step()
    print('epoch {}, loss {}'.format(epoch, loss.item()))
    print(out, y)
  1. If the criterion doesn’t contain any internal states (such as buffers etc.), then you could just reuse it (as is the case for nn.CrossEntropyLoss).

hi @ptrblck , thanks for your clear and quick reply.
In the training, i have done like you mentioned in answer1. actually, I have 3 losses: the formula is loss1+loss2-loss3

dummy code:

loss1_ = AverageMeter()
loss2_ = AverageMeter()
loss3_ = AverageMeter()

criterion1 = torch.nn.CrossEntropyLoss()
criterion2 = torch.nn.MSELoss(reduction="none")

model = ResNet18() # backbone
fc = nn.Linear() # classification layer connected to the backbone

for loop:
    feature1 = model(input1) # input a pair of images to the same backbone, input1 and input2 are two images in a same category
    feature2 = model(input2) 

    pred1 = fc(feature1) # classification layer is also shared
    pred2 = fc(feature2) 

    loss1_v = criterion1(pred1,label)
    loss2_v = criterion1(pred2,label)
    
    loss1_v.backward(retain_graph=True) # here, i can't find a way to avoid the use of retain_graph=True, because i need to reuse the graph later, do you have any suggestions?
    loss2_v.backward(retain_graph=True)

    loss3_v = -10 * criterion2(feature1,feature2).mean() # if i maximize the extracted feature difference
    loss3_v.backward()

    loss1_.update(loss_v1, 1)
    loss2_.update(loss_v2, 1)
    loss3_.update(loss_v3, 1)

However, in the training, i get loss1 and 2 are getting smaller and smaller, but the loss3 is bigger and bigger (‘less negative’) … i don’t know why, because you also mentioned, the model will try to make the loss “more negative”…

here is snippet of the loss print:

Training: 2021-05-10 19:47:32,997-Speed 1291.78 samples/sec   Loss1 38.0934   Loss2 38.0780   Loss3 -0.0298   Epoch: 0   Global Step: 500   Required: 5 hours
Training: 2021-05-10 19:56:11,611-Speed 918.99 samples/sec   Loss1 26.1710   Loss2 26.1691   Loss3 -0.0248   Epoch: 1   Global Step: 1000   Required: 5 hours
Training: 2021-05-10 20:04:47,075-Speed 1287.83 samples/sec   Loss1 20.0595   Loss2 20.0388   Loss3 -0.0190   Epoch: 2   Global Step: 1500   Required: 5 hours
Training: 2021-05-10 20:13:04,078-Speed 1288.24 samples/sec   Loss1 18.3962   Loss2 18.3840   Loss3 -0.0158   Epoch: 2   Global Step: 2000   Required: 5 hours
Training: 2021-05-10 20:21:21,709-Speed 1290.39 samples/sec   Loss1 17.2669   Loss2 17.2503   Loss3 -0.0138   Epoch: 3   Global Step: 2500   Required: 5 hours
Training: 2021-05-10 20:22:11,298-Speed 1290.64 samples/sec   Loss1 17.1906   Loss2 17.1985   Loss3 -0.0137   Epoch: 3   Global Step: 2550   Required: 5 hours
Training: 2021-05-10 20:30:17,556-Speed 1286.20 samples/sec   Loss1 16.4898   Loss2 16.4976   Loss3 -0.0125   Epoch: 4   Global Step: 3000   Required: 5 hours
Training: 2021-05-10 20:38:35,820-Speed 1287.99 samples/sec   Loss1 15.7010   Loss2 15.6975   Loss3 -0.0115   Epoch: 5   Global Step: 3500   Required: 5 hours
Training: 2021-05-10 20:46:52,820-Speed 1289.09 samples/sec   Loss1 15.9737   Loss2 15.9601   Loss3 -0.0107   Epoch: 5   Global Step: 4000   Required: 5 hours
Training: 2021-05-10 20:55:48,491-Speed 1290.52 samples/sec   Loss1 15.7084   Loss2 15.6894   Loss3 -0.0102   Epoch: 6   Global Step: 4500   Required: 5 hours

so my questions are:

  1. when the model is shared and do optimization to enlarge the feature difference, why the loss getting bigger and bigger?
  2. when i want to use a 3rd loss the constrain the feature difference, not prediction, do i have a way to avoid the use of retain_graph=True for more efficient computation?

thank you!

  1. I can’t comment on the use case and if the general usage of the loss functions is appropriate, but reducing one loss (and updating the parameters with these gradients) might increase another loss.

  2. You would need to use retain_graph=True if you are calling backward multiple times in order to avoid deleting the intermediate tensors, which are needed to compute the gradients. In case you could create a single loss (e.g. by adding all separate losses) and could thus call backward once, you wouldn’t need to use retain_graph=True.

thank you @ptrblck ! But in my case, regarding your answer2,

I can’t directly sum loss 1,2,3, because as the dummy code shows, loss1 and loss2 constrain the fc layer’s output, but loss3 constrain the backbone’s output.

Do you mean that I can sum the different loss and then backward once, even if they constrain the output of different layers?

hi @ptrblck , I want to add a learnable parameter multiplied with the CELoss, How Can I do this ? :pray:

You could create an nn.Parameter and directly multiply it with the loss output.
However, note that scaling the loss with a trainable parameter might make the parameter just negative to “reduce the loss” while the actual loss value could be increasing, so you have to double check your use case and make sure the training cannot “cheat” using this parameter.

2 Likes

Thak you @ptrblck , that is a very kind reminder :grin:

About the nn.Parameter, I am not clear about how to notify this new parameter to the optimizer?