Build your own loss function in PyTorch

  1. Yes, you don’t have to write any Lua code when you’re using PyTorch.
  2. Yes, the gradients will be computed automatically, as long as you use Variables all the time (without any .data unpacking or numpy conversions). It won’t work in your example, because you’re doing calculation on numpy arrays.
  3. Optimizers don’t need to know anything about your loss - they only need you to call .backward() on the loss Variable, so that they can see the gradient. They only need a list of Variables that you want to optimize.

Since the code does a lot of operations, the graph recording just the loss function would be likely much larger than that of your model. Because of this, I’d recommend you to write your own autograd function, or think a bit more about how can you compute your similarity matrix. If you’re operating in the Euclidean space, and you rewrite the formulas, it should be possible to batch some computation. As far as I see it could be decomposed into a Gramian matrix plus some norms added to the rows and columns.


Thanks a lot!

I rewrote everything using Torch, so now, it should work if I use loss.backward(X, y)?

Advises about writing my own autograd function and/or computing the similarity more efficiently are very welcome. It is definitely something that I need to do later, but for now I need just a simple version of this working.

Edit - It looks that it works by rewriting the final function as:

def customized_loss(X, y):
    X_similarity = Variable(similarity_matrix(X), requires_grad = True)
    association = Variable(convert_y(y), requires_grad = True)
    temp = torch.mul(X_similarity, association)
    loss_num = torch.sum(torch.mul(X_similarity, association))
    loss_all = torch.sum(X_similarity)
    loss_denum = loss_all - loss_num
    loss = loss_num/loss_denum
    return loss

All is good for now, thanks again!

No, this will not work. As I said, if you want your computation to be compatible with autograd, it needs to be executed on Variables from the start to the very end. You can’t unpack the tensors and repack them in the middle, because they won’t be connected to the initial graph, and will not forward the gradient. In your snippet no grad will be sent to X and y, because the backward will end on X_silimarity and association (they are graph leaves - their .creator is None).


If you want to operate on raw tensors, and have them wrapped in Variables in a way that ensures connectivity, you have to write a new Function. Otherwise, you have to pass in Variables to your similarity_matrix, but it might be very slow like that.

I think that I got lost now. Rewriting an another time the function (in probably a more readable way):

X = Variable(torch.Tensor([[0.6946, 0.1328], 
                           [0.6563, 0.6873], 
                           [0.8184, 0.8047], 
                           [0.8177, 0.4517], 
                           [0.1673, 0.2775], 
                           [0.6919, 0.0439],
                           [0.4659, 0.3032],
                           [0.3481, 0.1996]]))

y = Variable(torch.Tensor([1.0, 3.0, 2.0, 2.0, 3.0, 1.0, 2.0, 3.0]))

def customized_loss(X, y):
    def similarity_matrix(mat):
        a = mat.size()
        a = a[0]
        simMatrix = Variable(torch.zeros(a,a), requires_grad = True)
        for i in xrange(a):
            for j in xrange(a):
                simMatrix[i][j] = torch.norm(mat[i] - mat[j])         
        return simMatrix 
    def convert_y(y):
        a = y.size()
        a = a[0]
        converted_y = Variable(torch.zeros(a,a), requires_grad = True)
        for i in xrange(n):
            for j in xrange(n):
                if y[i] == y[j]:
                    converted_y[i, j] = 1
        return converted_y

    X_similarity = similarity_matrix(X)
    association = convert_y(y)
    loss_num = torch.sum(torch.mul(X_similarity, association))
    loss_all = torch.sum(X_similarity)
    loss_denum = loss_all - loss_num
    loss = loss_num/loss_denum
    return loss

loss = customized_loss(X, y)

As far as I can see, everything now is done in Variables (from beginning to the end). We are giving X and y (which are variables) to the function, and then everything is done in Variables. The only other variables that I need to define is simMatrix in the similarity_matrix function, and there I am having this error:

RuntimeError: in-place operations can be only used on variables that don't share storage with any other variables, but detected that there are 2 objects sharing it.

Of course, the same thing happens in convert_y function when I create the converted_y Variable.

And I have no clue, what is going wrong, while googling this error doesn’t show any result.

You already spent some time here, so thanks for that, but in case you can guide me how to fix this problem (or writing it if it is a quick fix) it would be awesome. From the pyTorch tutorial about the Variables it is not clear to me what I am doing wrong (haven’t ever used Torch). I guess that the problem is that I am implicitly creating a new Variable in the middle of the graph, but is there any way around it?

  1. So, the problem is: If I define simMatrix as Variable we have this problem with sharing storage, if we don’t define it as variable (which wouldn’t make too much sense cause we want its gradients in the backprop) then we also have an error of ‘can’t assign a Variable to a scalar value of type float’ which makes perfect sense.

  2. The other problem is that it seems that I cannot compare y[i] with y[j] in convert_y function. Because they are variables they are uncomparable, while if I use y[i].data (which likely makes problems during back-prop), strangely enough it makes a Runtime error saying that ‘bool value of non-empty torch.ByteTensor objects is ambiguous’.

Is there a solution around this?

@Ismail_Elezi As @apaszke said, you can compute the similarity matrix for the L2 distance using only matrix operations.
Here is an implementation for your similarity_matrix using only matrix operations. It can run on the GPU and is going to be significantly faster than your previous implementation.

# (x - y)^2 = x^2 - 2*x*y + y^2
def similarity_matrix(mat):
    # get the product x * y
    # here, y = x.t()
    r =, mat.t())
    # get the diagonal elements
    diag = r.diag().unsqueeze(0)
    diag = diag.expand_as(r)
    # compute the distance matrix
    D = diag + diag.t() - 2*r
    return D.sqrt()

If you are not backpropagating through y, no need to wrap it all in variables, just wrap the last result.


@fmassa Thanks for your solution. I definitely need to refresh my linear algebra skills. It really solves the first part.

About, if you are not backpropagating through y part…I am a bit confused. Essentially, the algorithm is:

  1. Build a CNN that on the final layer has 2 neurons.
  2. Transform the output of those 2 neurons (a tensor of shape n x 2) to similarity matrix. Call it X.
  3. Transform the labels y into a n x n tensor (where ij-th element is 1 if i and j belong to the same cluster, 0 otherwise). Call it Y.
  4. Do an elementwise multiplication of X and Y.
  5. Do the extra stuff, sum, some substraction etc.

While I do not need to backprop through y, Y is multiplied with X (and Y comes from y), so I think that I need to backprop through Y, right?


@Ismail_Elezi because the targets are constants, you don’t need to compute the gradients through Y. You can say that your effective target is actually Y, and not y, and Y does not require gradient (requires_grad=False).
So all you need to do is do the operations converting y in Y without using Variables, and then wrap the resulting Y in a variable.
For reference, here is an implementation of convert_y that does not require a for loop, and can be efficiently performed.

def convert_y2(y):
    s = y.size(0)
    y_expand = y.unsqueeze(0).expand(s, s)
    Y = y_expand.eq(y_expand.t())
    return Y


You’re absolutely right.

About your function, it returns a ByteTensor which means that Y cannot be multiplied with X (which is a FloatTensor). There should be something that allows casting a Tensor to some other type, right?

Yes, you can cast the ByteTensor to any other type by using the following, which is described in the documentation

a = torch.ByteTensor([0,1,0])
b = a.float() # converts to float
c = a.type('torch.FloatTensor') # converts to float as well

Possible shortcuts for the conversion are the following:

  • .byte()
  • .short()
  • .char()
  • .int()
  • .long()
  • .float()
  • .double
  • .half() # for cuda only at the moment

Excellent! Thanks a lot!

I think that I still need to fully understand how these functions work (read them in details), but everything is working now. Of course, the ANN isn’t working (a lot of NANs immediately after the first iteration), but that is something that I need to investigate and see the gradients’ values.

@Ismail_Elezi Just to be sure, the first version of snippet that I sent you had some numerical instabilities that could lead to nan some times, but I then fixed it a couple of minutes later.


I am using the version that is written on this thread, and is giving NaNs after the first iteration. Typically, when I got Nans in the past it was either an error in differentiation or a large training rate. Both of these cannot be this time because it is automatic differentiation and I am using extremely small training rates (3e-7 in Adam, while typically I use 3e-4).

Anyway, I will have to read the documentation to find how to print the gradients, which might give me an idea on what is going wrong. And then, if I have problems, I will make an another thread.

Thanks for everything!

You can use hooks e.g.:

x = Variable(torch.randn(5, 5), requires_grad=True)
y = Variable(torch.randn(5, 5), requires_grad=True)
z = x + y
# this will work only in Python3
z.register_hook(lambda g: print(g)) 
# if you're using Python2 do this:
# def pring_grad(g):
#     print g
# z.register_hook(print_grad)
q = z.sum()

This will print the gradient w.r.t. z at each backward.


Excellent, I will try it and see what is going wrong.

Great talk about defining the own Loss function. I would love to have a simple example of creating own loss.
I’m too confused about the idea of creating the loss function in torch. As author, I do not realy get what we really need to do or which function we can use if we want to define own loss function. It would be nice to have a complete example for “Similarity-Matrix” or sth like “Triplet-Loss”.

About NaNs is your results: it is related to Loss Function. I was implementing sth like that in TensorFlow and I get NaNs too. Based on experiments, it look like the gradient of diagonal in similarity matrix cause NaNs. I modified to skip diagonal (so take just right upper triangle matrix, as left down is just the same).

@melgor You have an example of Triplet loss here

I hope in some days I’ll PR it to the main repository.
Until now still testing and tuning parameters since I’m not getting the same performance as in LuaTorch.



In your example, it just prints a Tensor of shape 5x5 with all ones in it. If I use a working example (for example the tutorial on CIFAR-10 dataset: - doing a single iteration - and I write:

loss.register_hook(lambda g: print(g))

I get:

Variable containing:
[torch.FloatTensor of size 1]

Variable containing:
[torch.FloatTensor of size 1]

which isn’t very helpful. Now, on my example, if I want all the gradients which are computed on my loss function, how can I use register_hook to do so?

Having a tensor filled with ones is expected in my example, because that’s the gradient w.r.t. z. You can register the hook on any Variable of which gradient you want to inspect. If you want all gradients of everything you will need to register a hook on every intermediate output.

@melgor it has been giving NaNs in the previous @fmassa’s implementation, but the solution posted now should be quite stable, even on the diagonal. But to be sure it is safer to recostruct it from TRIL matrix.