Build your own loss function in PyTorch

@Ismail_Elezi because the targets are constants, you don’t need to compute the gradients through Y. You can say that your effective target is actually Y, and not y, and Y does not require gradient (requires_grad=False).
So all you need to do is do the operations converting y in Y without using Variables, and then wrap the resulting Y in a variable.
For reference, here is an implementation of convert_y that does not require a for loop, and can be efficiently performed.

def convert_y2(y):
    s = y.size(0)
    y_expand = y.unsqueeze(0).expand(s, s)
    Y = y_expand.eq(y_expand.t())
    return Y

@fmassa

You’re absolutely right.

About your function, it returns a ByteTensor which means that Y cannot be multiplied with X (which is a FloatTensor). There should be something that allows casting a Tensor to some other type, right?

Yes, you can cast the ByteTensor to any other type by using the following, which is described in the documentation

a = torch.ByteTensor([0,1,0])
b = a.float() # converts to float
c = a.type('torch.FloatTensor') # converts to float as well

Possible shortcuts for the conversion are the following:

  • .byte()
  • .short()
  • .char()
  • .int()
  • .long()
  • .float()
  • .double
  • .half() # for cuda only at the moment
2 Likes

Excellent! Thanks a lot!

I think that I still need to fully understand how these functions work (read them in details), but everything is working now. Of course, the ANN isn’t working (a lot of NANs immediately after the first iteration), but that is something that I need to investigate and see the gradients’ values.

@Ismail_Elezi Just to be sure, the first version of snippet that I sent you had some numerical instabilities that could lead to nan some times, but I then fixed it a couple of minutes later.

@fmassa

I am using the version that is written on this thread, and is giving NaNs after the first iteration. Typically, when I got Nans in the past it was either an error in differentiation or a large training rate. Both of these cannot be this time because it is automatic differentiation and I am using extremely small training rates (3e-7 in Adam, while typically I use 3e-4).

Anyway, I will have to read the documentation to find how to print the gradients, which might give me an idea on what is going wrong. And then, if I have problems, I will make an another thread.

Thanks for everything!

You can use hooks e.g.:

x = Variable(torch.randn(5, 5), requires_grad=True)
y = Variable(torch.randn(5, 5), requires_grad=True)
z = x + y
# this will work only in Python3
z.register_hook(lambda g: print(g)) 
# if you're using Python2 do this:
# def pring_grad(g):
#     print g
# z.register_hook(print_grad)
q = z.sum()
q.backward()

This will print the gradient w.r.t. z at each backward.

10 Likes

Excellent, I will try it and see what is going wrong.

Great talk about defining the own Loss function. I would love to have a simple example of creating own loss.
I’m too confused about the idea of creating the loss function in torch. As author, I do not realy get what we really need to do or which function we can use if we want to define own loss function. It would be nice to have a complete example for “Similarity-Matrix” or sth like “Triplet-Loss”.

About NaNs is your results: it is related to Loss Function. I was implementing sth like that in TensorFlow and I get NaNs too. Based on experiments, it look like the gradient of diagonal in similarity matrix cause NaNs. I modified to skip diagonal (so take just right upper triangle matrix, as left down is just the same).

@melgor You have an example of Triplet loss here

I hope in some days I’ll PR it to the main repository.
Until now still testing and tuning parameters since I’m not getting the same performance as in LuaTorch.

3 Likes

@apaszke

In your example, it just prints a Tensor of shape 5x5 with all ones in it. If I use a working example (for example the tutorial on CIFAR-10 dataset: https://github.com/pytorch/tutorials/blob/master/Deep%20Learning%20with%20PyTorch.ipynb) - doing a single iteration - and I write:

loss.register_hook(lambda g: print(g))

I get:

Variable containing:
1
[torch.FloatTensor of size 1]

Variable containing:
1
[torch.FloatTensor of size 1]

which isn’t very helpful. Now, on my example, if I want all the gradients which are computed on my loss function, how can I use register_hook to do so?

Having a tensor filled with ones is expected in my example, because that’s the gradient w.r.t. z. You can register the hook on any Variable of which gradient you want to inspect. If you want all gradients of everything you will need to register a hook on every intermediate output.

@melgor it has been giving NaNs in the previous @fmassa’s implementation, but the solution posted now should be quite stable, even on the diagonal. But to be sure it is safer to recostruct it from TRIL matrix.

@apaszke

Thanks, that makes sense and now I found the error.

@apaszke and @melgor

@fmassa solution is not stable on the diagonal. The error happens in:
D = diag + diag.t() - 2*r

the diagonal here becomes zero (which is correct), but the gradient for whatever reason become NaN when we do the following command:

D = D.sqrt()

Could this mean that the diagonal entries are slightly smaller than 0 and then when we find the square root, they become NaNs? A cheap solution (which seems to work, for now) is to modify that line to:

D = diag + diag.t() - 2*r + 1e-7

though, I am not sure if that doesn’t break anything else (I mean, the loss is decreasing, but not sure that all the computations are correct).

On a side note, if I want to normalize the X_similarity matrix, this doesn’t seem to work:

X_similarity = (X_similarity - X_similarity.mean())/X_similarity.std()

When I tried on an experiment with tensors, it works but here that X_similarity is a variable, it is not working.

@Ismail_Elezi yes, it could lead to nan because of numerical instability.
I’d say that you don’t really need the .sqrt(), I added it to make the function comparable to yours. Also, adding a small epsilon shouldn’t be a problem.

About your second issue, it doesn’t work because the mean of a Tensor is a number, but in a Variable it’s a 1D Variable.
We don’t yet have broadcasting in pytorch, so you need to expand it by hand, as discussed in Adding a scalar? .
If you need to insert dimensions to your tensor for that, use the unsqueeze function.

1 Like

@fmassa

.sqrt is needed just to make it an Euclidean distance, but anyway, adding the small epsilon seems to solve the problem (alternately, @apaszke idea seems worth investigating).

For the second question, good to know it. I am not sure that I even need it right now, but I’ll look into it if needed.

Thanks for everything man, you’re a lifesaver!

@apaszke BTW, what can I do in order to write a custom backward routine?

1 Like

@edgarriba what’s a custom backward routine? If something is supported by autograd, you have backward for free. If not, you have to add a function as described in the notes. Keep in mind that autograd may do some optimizations that assume that backward method computes a correct gradient - if you want to mess it up in any way, use hooks.

@apaszke For example I would like to write a custom forward/backward function for generating triplets. The flow is:

  1. Forward pass the N images through net to get features
  2. Based on labels information, generate M random/hard triplets (we use features from last layer of network).
  3. These triplets go to Loss function (ex. Triplet Loss)
  4. Loss.backward() return the gradient.
  5. As I was generating the triplets and network works on just single image, I need to map the gradient from Loss (of size [M x Feature_Size x 3]) to gradient input to network (size [N x Feature_Size]).
  6. Model.backward()
  7. Run Optimization step

I implemented it using Lua-Torch and there was not problem with it because I have access to gradient of Loss function and I can manipulate before feeding it to model.

Do you think that AutoGrad will handle it or I need to write custom backward pass?

@apaszke just to check that things are well calculated since until now I’ve not managed to converge the triplet network