# Build your own loss function in PyTorch

Hi all! Started today using PyTorch and it seems to me more natural than Tensorflow. However, I would need to write a customized loss function. While it would be nice to be able to write any loss function, my loss function is a bit specific.So, I am giving it (written on torch)

``````X = np.asarray([[0.6946, 0.1328], [0.6563, 0.6873], [0.8184, 0.8047], [0.8177, 0.4517],
[0.1673, 0.2775], [0.6919, 0.0439], [0.4659, 0.3032], [0.3481, 0.1996]], dtype=np.float32)
X = torch.from_numpy(X)
y = np.asarray((1,3,2,2,3,1,2,3), dtype=np.float32)
y = torch.from_numpy(y)

def similarity(i, j):
''' This function defines the similarity between vectors i and j
inputs: i, j - vectors of the same length
sigma - the denumerator parameter
output: sim - similarity value (real number from 0 to 1) '''

dist = torch.norm(i - j)
return dist

def similarity_matrix(mat):
''' This function creates the similarity matrix of a dataset
input: mat - dataset in matrix format
sigma - a paramter which defines similarity
output: simMatrix - the similarity matrix '''

a = mat.size()
a = a[0]
simMatrix = torch.zeros((a,a))
for i in xrange(a):
for j in xrange(a):
simMatrix[i][j] = similarity(mat[i], mat[j])
return simMatrix

def convert_y(y):
n = y.size()
n = n[0]
converted_y = torch.zeros((n, n))
for i in xrange(n):
for j in xrange(n):
if y[i] == y[j]:
converted_y[i, j] = 1
return converted_y

def customized_loss(X, y):
X_similarity = similarity_matrix(X)
association = convert_y(y)
loss_num = torch.sum(torch.mul(X_similarity, association))
loss_all = torch.sum(X_similarity)
loss_denum = loss_all - loss_num
loss = loss_num/loss_denum
return loss

loss = customized_loss(X, y)
print(loss)
``````

Now, of course, considering that I am going to use it as the final layer, of the neural net, I would need to compute the gradients of it and then use them in the backpropagation.

Explaining the function a bit:

I first transform the input data space into a kind of similarity matrix (0 it means the data being the same, the higher the number in ij-th entry, the higher is the dissimilarity). Then in order to find the intra-cluster loss, I multiply this matrix with a 0/1 matrix, where the ij-th entry is 1 if the element i and j are in the same cluster, 0 otherwise. The intra-cluster loss is find similarity, and finally, we just divide the two losses.

My questions are:

1. Can this be done in PyTorch, without writing Lua code?
2. Can the gradients of this be computed in an automatic way (torch autograd)?
3. Can such a loss function be given as input in optim.SGD? (optim.X in general case where X is the optimization algorithm)

Thanks for any answer, or possible hint.

17 Likes
1. Yes, you donât have to write any Lua code when youâre using PyTorch.
2. Yes, the gradients will be computed automatically, as long as you use `Variable`s all the time (without any `.data` unpacking or numpy conversions). It wonât work in your example, because youâre doing calculation on numpy arrays.
3. Optimizers donât need to know anything about your loss - they only need you to call `.backward()` on the loss `Variable`, so that they can see the gradient. They only need a list of `Variable`s that you want to optimize.

Since the code does a lot of operations, the graph recording just the loss function would be likely much larger than that of your model. Because of this, Iâd recommend you to write your own autograd function, or think a bit more about how can you compute your similarity matrix. If youâre operating in the Euclidean space, and you rewrite the formulas, it should be possible to batch some computation. As far as I see it could be decomposed into a Gramian matrix plus some norms added to the rows and columns.

16 Likes

Thanks a lot!

I rewrote everything using Torch, so now, it should work if I use loss.backward(X, y)?

Advises about writing my own autograd function and/or computing the similarity more efficiently are very welcome. It is definitely something that I need to do later, but for now I need just a simple version of this working.

Edit - It looks that it works by rewriting the final function as:

``````def customized_loss(X, y):
X_similarity = Variable(similarity_matrix(X), requires_grad = True)
association = Variable(convert_y(y), requires_grad = True)
temp = torch.mul(X_similarity, association)
loss_num = torch.sum(torch.mul(X_similarity, association))
loss_all = torch.sum(X_similarity)
loss_denum = loss_all - loss_num
loss = loss_num/loss_denum
return loss
``````

All is good for now, thanks again!

No, this will not work. As I said, if you want your computation to be compatible with autograd, it needs to be executed on Variables from the start to the very end. You canât unpack the tensors and repack them in the middle, because they wonât be connected to the initial graph, and will not forward the gradient. In your snippet no grad will be sent to `X` and `y`, because the backward will end on `X_silimarity` and `association` (they are graph leaves - their `.creator` is None).

4 Likes

If you want to operate on raw tensors, and have them wrapped in Variables in a way that ensures connectivity, you have to write a new Function. Otherwise, you have to pass in Variables to your `similarity_matrix`, but it might be very slow like that.

I think that I got lost now. Rewriting an another time the function (in probably a more readable way):

``````X = Variable(torch.Tensor([[0.6946, 0.1328],
[0.6563, 0.6873],
[0.8184, 0.8047],
[0.8177, 0.4517],
[0.1673, 0.2775],
[0.6919, 0.0439],
[0.4659, 0.3032],
[0.3481, 0.1996]]))

y = Variable(torch.Tensor([1.0, 3.0, 2.0, 2.0, 3.0, 1.0, 2.0, 3.0]))

def customized_loss(X, y):

def similarity_matrix(mat):
a = mat.size()
a = a[0]
simMatrix = Variable(torch.zeros(a,a), requires_grad = True)
for i in xrange(a):
for j in xrange(a):
simMatrix[i][j] = torch.norm(mat[i] - mat[j])
return simMatrix

def convert_y(y):
a = y.size()
a = a[0]
converted_y = Variable(torch.zeros(a,a), requires_grad = True)
for i in xrange(n):
for j in xrange(n):
if y[i] == y[j]:
converted_y[i, j] = 1
return converted_y

X_similarity = similarity_matrix(X)
association = convert_y(y)
loss_num = torch.sum(torch.mul(X_similarity, association))
loss_all = torch.sum(X_similarity)
loss_denum = loss_all - loss_num
loss = loss_num/loss_denum
return loss

loss = customized_loss(X, y)
``````

As far as I can see, everything now is done in Variables (from beginning to the end). We are giving X and y (which are variables) to the function, and then everything is done in Variables. The only other variables that I need to define is simMatrix in the similarity_matrix function, and there I am having this error:

``````RuntimeError: in-place operations can be only used on variables that don't share storage with any other variables, but detected that there are 2 objects sharing it.
``````

Of course, the same thing happens in convert_y function when I create the converted_y Variable.

And I have no clue, what is going wrong, while googling this error doesnât show any result.

âŚ

You already spent some time here, so thanks for that, but in case you can guide me how to fix this problem (or writing it if it is a quick fix) it would be awesome. From the pyTorch tutorial about the Variables it is not clear to me what I am doing wrong (havenât ever used Torch). I guess that the problem is that I am implicitly creating a new Variable in the middle of the graph, but is there any way around it?

1. So, the problem is: If I define simMatrix as Variable we have this problem with sharing storage, if we donât define it as variable (which wouldnât make too much sense cause we want its gradients in the backprop) then we also have an error of âcanât assign a Variable to a scalar value of type floatâ which makes perfect sense.

2. The other problem is that it seems that I cannot compare y[i] with y[j] in convert_y function. Because they are variables they are uncomparable, while if I use y[i].data (which likely makes problems during back-prop), strangely enough it makes a Runtime error saying that âbool value of non-empty torch.ByteTensor objects is ambiguousâ.

Is there a solution around this?

@Ismail_Elezi As @apaszke said, you can compute the similarity matrix for the L2 distance using only matrix operations.
Here is an implementation for your `similarity_matrix` using only matrix operations. It can run on the GPU and is going to be significantly faster than your previous implementation.

``````# (x - y)^2 = x^2 - 2*x*y + y^2
def similarity_matrix(mat):
# get the product x * y
# here, y = x.t()
r = torch.mm(mat, mat.t())
# get the diagonal elements
diag = r.diag().unsqueeze(0)
diag = diag.expand_as(r)
# compute the distance matrix
D = diag + diag.t() - 2*r
return D.sqrt()
``````

If you are not backpropagating through `y`, no need to wrap it all in variables, just wrap the last result.

13 Likes

@fmassa Thanks for your solution. I definitely need to refresh my linear algebra skills. It really solves the first part.

About, if you are not backpropagating through y partâŚI am a bit confused. Essentially, the algorithm is:

1. Build a CNN that on the final layer has 2 neurons.
2. Transform the output of those 2 neurons (a tensor of shape n x 2) to similarity matrix. Call it X.
3. Transform the labels y into a n x n tensor (where ij-th element is 1 if i and j belong to the same cluster, 0 otherwise). Call it Y.
4. Do an elementwise multiplication of X and Y.
5. Do the extra stuff, sum, some substraction etc.

While I do not need to backprop through y, Y is multiplied with X (and Y comes from y), so I think that I need to backprop through Y, right?

Cheers!

@Ismail_Elezi because the targets are constants, you donât need to compute the gradients through Y. You can say that your effective target is actually Y, and not y, and Y does not require gradient (`requires_grad=False`).
So all you need to do is do the operations converting y in Y without using Variables, and then wrap the resulting Y in a variable.
For reference, here is an implementation of `convert_y` that does not require a `for loop`, and can be efficiently performed.

``````def convert_y2(y):
s = y.size(0)
y_expand = y.unsqueeze(0).expand(s, s)
Y = y_expand.eq(y_expand.t())
return Y
``````

@fmassa

Youâre absolutely right.

About your function, it returns a ByteTensor which means that Y cannot be multiplied with X (which is a FloatTensor). There should be something that allows casting a Tensor to some other type, right?

Yes, you can cast the `ByteTensor` to any other type by using the following, which is described in the documentation

``````a = torch.ByteTensor([0,1,0])
b = a.float() # converts to float
c = a.type('torch.FloatTensor') # converts to float as well
``````

Possible shortcuts for the conversion are the following:

• `.byte()`
• `.short()`
• `.char()`
• `.int()`
• `.long()`
• `.float()`
• `.double`
• `.half()` # for cuda only at the moment
2 Likes

Excellent! Thanks a lot!

I think that I still need to fully understand how these functions work (read them in details), but everything is working now. Of course, the ANN isnât working (a lot of NANs immediately after the first iteration), but that is something that I need to investigate and see the gradientsâ values.

@Ismail_Elezi Just to be sure, the first version of snippet that I sent you had some numerical instabilities that could lead to `nan` some times, but I then fixed it a couple of minutes later.

@fmassa

I am using the version that is written on this thread, and is giving NaNs after the first iteration. Typically, when I got Nans in the past it was either an error in differentiation or a large training rate. Both of these cannot be this time because it is automatic differentiation and I am using extremely small training rates (3e-7 in Adam, while typically I use 3e-4).

Anyway, I will have to read the documentation to find how to print the gradients, which might give me an idea on what is going wrong. And then, if I have problems, I will make an another thread.

Thanks for everything!

You can use hooks e.g.:

``````x = Variable(torch.randn(5, 5), requires_grad=True)
z = x + y
# this will work only in Python3
z.register_hook(lambda g: print(g))
# if you're using Python2 do this:
#     print g
q = z.sum()
q.backward()
``````

This will print the gradient w.r.t. `z` at each backward.

10 Likes

Excellent, I will try it and see what is going wrong.

Great talk about defining the own Loss function. I would love to have a simple example of creating own loss.
Iâm too confused about the idea of creating the loss function in torch. As author, I do not realy get what we really need to do or which function we can use if we want to define own loss function. It would be nice to have a complete example for âSimilarity-Matrixâ or sth like âTriplet-Lossâ.

About NaNs is your results: it is related to Loss Function. I was implementing sth like that in TensorFlow and I get NaNs too. Based on experiments, it look like the gradient of diagonal in similarity matrix cause NaNs. I modified to skip diagonal (so take just right upper triangle matrix, as left down is just the same).

@melgor You have an example of Triplet loss here

I hope in some days Iâll PR it to the main repository.
Until now still testing and tuning parameters since Iâm not getting the same performance as in LuaTorch.

3 Likes

@apaszke

In your example, it just prints a Tensor of shape 5x5 with all ones in it. If I use a working example (for example the tutorial on CIFAR-10 dataset: https://github.com/pytorch/tutorials/blob/master/Deep%20Learning%20with%20PyTorch.ipynb) - doing a single iteration - and I write:

`loss.register_hook(lambda g: print(g))`

I get:

``````Variable containing:
1
[torch.FloatTensor of size 1]

Variable containing:
1
[torch.FloatTensor of size 1]
``````

which isnât very helpful. Now, on my example, if I want all the gradients which are computed on my loss function, how can I use register_hook to do so?

Having a tensor filled with ones is expected in my example, because thatâs the gradient w.r.t. `z`. You can register the hook on any Variable of which gradient you want to inspect. If you want all gradients of everything you will need to register a hook on every intermediate output.