Build your own loss function in PyTorch

I can’t agree more

Writing a loss function is no different from writing a neural network, or an autograd function.

Here’s an example of writing a mean-square-error loss function:

def mse_loss(input, target):
    return ((input - target) ** 2).sum() /

@smth but will this version have ability to backpropagate ? i think we need to perform those functions on autograd Variable ??

Yes, smth’s function is taking Variables as input. So you will be able to backpropagate.

Hi all,

I struggled with this myself, so I’ve started building a tutorial for such stuff in PyTorch. You can find a section on custom losses there too (Section 5). Github link -

I wrote this up quickly in my free time so it must have some typos etc. If you think there’s things you would like to see there but are missing, feel free to create an issue on GitHub to make suggestions. Hope this helps!


I have already implemented my own loss in python, but it is too slow. Is there any tutorials which can teach me
to speed it up?(there is a for loop in my loss)

1 Like

if the individual loss for a sample in a batch can be positive or negative depending on some conditions, how do i sum the loss over samples? it will become zero if i sum all the samples within a batch.

1 Like

excuse me, have you figured this out? So it’s necessary that writing a custom backward function and then return the gradient by self?

I would like to use euclidean loss in pytorch. I was writing the formaula. But it is not working. Is this loss function already available in pytorch library? How do i use euclidean loss in network. Thank you.

In pytorch it is called MSELoss:

Thank you very much for the help.

In this paper, they have used euclidean loss for translation and orientation. Can i use the same loss function by using MSEloss for the regression problem?

Hi, when you give Euclidean loss between x1 and x2,

loss = torch.norm(x1 - x2, 2)

seems proper implementation.

1 Like

Hi Adam,
I have read this post several times, however, i don’t understand some terminologies, such as “re-wrap the .data in a new Variable”, “.data unpacking”, and “.data repacking”, would you mind showing some examples. Thank you so much.

In addition, i have a special requirement for center_loss, i.e., i need to set different weight to each class. So I refine the code ( show in PyTorch exmaples, and reinplement this loss function by myself.

In exmaples,
def get_center_loss(centers, features, target, alpha, num_classes):
batch_size = target.size(0)
features_dim = features.size(1)

target_expand = target.view(batch_size,1).expand(batch_size,features_dim)
centers_var = Variable(centers)
centers_batch = centers_var.gather(0,target_expand)
criterion = nn.MSELoss()
center_loss = criterion(features,  centers_batch)

diff = centers_batch - features
unique_label, unique_reverse, unique_count = np.unique(target.cpu().data.numpy(), return_inverse=True, return_counts=True)
appear_times = torch.from_numpy(unique_count).gather(0,torch.from_numpy(unique_reverse))
appear_times_expand = appear_times.view(-1,1).expand(batch_size,features_dim).type(torch.FloatTensor)
diff_cpu = diff.cpu().data / appear_times_expand.add(1e-6)
diff_cpu = alpha * diff_cpu
for i in range(batch_size):
    centers[[i]] -= diff_cpu[i].type(centers.type())

return center_loss, centers

the call of this function:
center_loss, self.model._buffers[‘centers’] = get_center_loss(self.model._buffers[‘centers’], self.model.features, target_var, self.alpha, self.model.num_classes)
softmax_loss = self.criterion(output, target_var)
loss = self.center_loss_weight*center_loss + softmax_loss

My refinement:

self.centers = torch.zeros(num_classes, embedding_size).type(torch.FloatTensor) # 2d tensor
x = self.fc2(x)
self.features = F.relu(x) # 2D tensor

def get_center_loss(self, target, class_weight, alpha):
batch_size = target.size(0)
features_dim = self.features.size(1)

    target_expand = target.view(batch_size,1).expand(batch_size,features_dim)

    centers_var = Variable(self.centers)
    centers_batch = centers_var.gather(0,target_expand).cuda()

    abnormal_loss = Variable(torch.FloatTensor([0]), requires_grad=True)
    normal_loss = Variable(torch.FloatTensor([0]), requires_grad=True)
    for i in range(batch_size):
        if[i] == 0:
            #abnormal_loss += torch.sum(([i,:] -[i,:]) **2)
            abnormal_loss = abnormal_loss.clone() + ([i,:] -[i,:]).pow(2).sum()
            #normal_loss += torch.sum(([i,:] -[i,:]) **2)
            normal_loss = normal_loss.clone() + ([i,:] -[i,:]).pow(2).sum()
    center_loss = class_weight[0] * abnormal_loss + class_weight[1] * normal_loss
    center_loss = center_loss/features_dim/batch_size

    diff = centers_batch - self.features

    unique_label, unique_reverse, unique_count = np.unique(target.cpu().data.numpy(), return_inverse=True, return_counts=True)

    appear_times = torch.from_numpy(unique_count).gather(0,torch.from_numpy(unique_reverse))

    appear_times_expand = appear_times.view(-1,1).expand(batch_size,features_dim).type(torch.FloatTensor)

    diff_cpu = diff.cpu().data / appear_times_expand.add(1e-6)
    #∆c_j =(sum_i=1^m δ(yi = j)(c_j - x_i)) / (1 + sum_i=1^m δ(yi = j))
    diff_cpu = alpha * diff_cpu

    for i in range(batch_size):
        #Update the parameters c_j for each j by c^(t+1)_j = c^t_j − α · ∆c^t_j
        self.centers[[i]] -= diff_cpu[i].type(self.centers.type())

    return center_loss, self.centers, colorization_loss/features_dim/batch_size, normal_loss/features_dim/batch_size

Here, class_weight = torch.FloatTensor([10, 100]) # two weight for center_loss (binary classification)

The call:
criterion = nn.CrossEntropyLoss().cuda()
prediction = model(data_var)

    center_loss, xx, abnormal_loss, normal_loss = model.get_center_loss(target_var, class_weight, args.alpha)
    classfier_loss = criterion(prediction.cuda(), target_var.cuda())
    loss = center_loss.cuda() + classfier_loss
    # compute gradient and update weights

Is is right for this code? can this backward to change the weights of model?

The .clone() here is unnecessary. The addition operation clones the Variable, so you don’t have to do so explicitly. In fact if you do, you just add an extra copy in memory. That said, if you were to do an inplace addition +=, then using .clone() might be necessary, but even then, I would wait until PyTorch complained about the inplace operation.

If I understand correctly, the actual loss that needs to be backpropagated is center_loss.
Now center_loss = weighted sum of abnormal_loss and normal_loss so gradients can flow back up to abnormal_loss and normal_loss.
But both of those are calculated from Tensors, not from Variables, so the gradients will go no further. Try this instead…

abnormal_loss = abnormal_loss + (self.features[i,:] - centers_batch[i,:]).pow(2).sum()

Same for normal_loss

normal_loss = normal_loss + (self.features[i,:] - centers_batch[i,:]).pow(2).sum()

I want to define a loss function that is defined piece-wise while both pieces “touch” each other. However, the gradient at the point where they touch is not the same. Is this possible or must the gradient be identical from both directions? If yes, I could possibly also modify the definition of the function, but if it’s not required I would avoid this

thanks. thats helpful. it would be nice for a beginner like me if you could show the usage of this custom loss function in an example. I wanted to cross check how the input x and y are used when Regress_Loss is called.

If I just define the loss function as in here I am not able to send it to cuda, i.e. mse_loss.cuda() will fail with the Traceback: AttributeError: ‘function’ object has no attribute ‘cuda’
I am clearly doing something wrong… I guess I am not defining something correctly. Could anyone please help? I am quite a beginner in pytorch, so I could learn a lot from this.

You don’t have to push it to GPU.
This is only required for modules having an internal state (and tensors). The provided function is a plain python function and works on GPU as long as both input tensors are on (the same) GPU.


I am facing a issue of backprop in custom loss function, I probably know the reason but not sure how to solve it. I am actually implemented the quadratic weighted kappa loss( I changed this code in pytorch.

class WeightedKappaLoss(torch.nn.Module):
    def __init__(self):
    def forward(self, preds, true,nb_classes = None):
        nb_classes = preds.shape[1]
        _,pred = torch.max(preds.view(1,-1).type(torch.cuda.FloatTensor),1)
        pred = pred.type(torch.cuda.FloatTensor)
        pred.requires_grad = True
        confusion_matrix = torch.empty([nb_classes, nb_classes],requires_grad = True)
        for t, p in zip(pred.view(-1), true.view(-1)):
            confusion_matrix[p.long(), t.long()] += 1
        weights = torch.empty([nb_classes,nb_classes],requires_grad = True)
        for i in range(len(weights)):
            for j in range(len(weights)):
                weights[i][j] = float(((i-j)**2)/(len(weights)-1)**2)
        true_hist= torch.empty([nb_classes],requires_grad = True)
        for item in true: 
        pred_hist=torch.empty([nb_classes],requires_grad = True)
        for item in pred: 
        E = torch.ger(true_hist,pred_hist)
        E = E/E.sum()
        confusion_matrix = confusion_matrix/confusion_matrix.sum()
        num = (confusion_matrix*weights).sum()
        den = (E*weights).sum()
        return num/den

The code is giving this error - leaf variable has been moved into the graph interior.
I think error is because of no relation of confusion matrix to the outputs of neural network(directly). Any ideas how to solve it.

1 Like


The problem is that _,pred = torch.max(preds.view(1,-1).type(torch.cuda.FloatTensor),1) is not a differentiable operation. So you cannot have gradients flowing back from pred to preds.
In general, if you have to set the requires_grad=True flag by hand on an intermediary value it means that an operation before was not differentiable and so you won’t get the gradients you want!

You can look around (google or other post on this forum) for differentiable functions to replace .max_indices() but they are all quite heuristic.