# Build your own loss function in PyTorch

@Ismail_Elezi Just to be sure, the first version of snippet that I sent you had some numerical instabilities that could lead to `nan` some times, but I then fixed it a couple of minutes later.

@fmassa

I am using the version that is written on this thread, and is giving NaNs after the first iteration. Typically, when I got Nans in the past it was either an error in differentiation or a large training rate. Both of these cannot be this time because it is automatic differentiation and I am using extremely small training rates (3e-7 in Adam, while typically I use 3e-4).

Anyway, I will have to read the documentation to find how to print the gradients, which might give me an idea on what is going wrong. And then, if I have problems, I will make an another thread.

Thanks for everything!

You can use hooks e.g.:

``````x = Variable(torch.randn(5, 5), requires_grad=True)
z = x + y
# this will work only in Python3
z.register_hook(lambda g: print(g))
# if you're using Python2 do this:
#     print g
q = z.sum()
q.backward()
``````

This will print the gradient w.r.t. `z` at each backward.

10 Likes

Excellent, I will try it and see what is going wrong.

Great talk about defining the own Loss function. I would love to have a simple example of creating own loss.
I’m too confused about the idea of creating the loss function in torch. As author, I do not realy get what we really need to do or which function we can use if we want to define own loss function. It would be nice to have a complete example for “Similarity-Matrix” or sth like “Triplet-Loss”.

About NaNs is your results: it is related to Loss Function. I was implementing sth like that in TensorFlow and I get NaNs too. Based on experiments, it look like the gradient of diagonal in similarity matrix cause NaNs. I modified to skip diagonal (so take just right upper triangle matrix, as left down is just the same).

@melgor You have an example of Triplet loss here

I hope in some days I’ll PR it to the main repository.
Until now still testing and tuning parameters since I’m not getting the same performance as in LuaTorch.

3 Likes

@apaszke

In your example, it just prints a Tensor of shape 5x5 with all ones in it. If I use a working example (for example the tutorial on CIFAR-10 dataset: https://github.com/pytorch/tutorials/blob/master/Deep%20Learning%20with%20PyTorch.ipynb) - doing a single iteration - and I write:

`loss.register_hook(lambda g: print(g))`

I get:

``````Variable containing:
1
[torch.FloatTensor of size 1]

Variable containing:
1
[torch.FloatTensor of size 1]
``````

which isn’t very helpful. Now, on my example, if I want all the gradients which are computed on my loss function, how can I use register_hook to do so?

Having a tensor filled with ones is expected in my example, because that’s the gradient w.r.t. `z`. You can register the hook on any Variable of which gradient you want to inspect. If you want all gradients of everything you will need to register a hook on every intermediate output.

@melgor it has been giving NaNs in the previous @fmassa’s implementation, but the solution posted now should be quite stable, even on the diagonal. But to be sure it is safer to recostruct it from TRIL matrix.

@apaszke

Thanks, that makes sense and now I found the error.

@fmassa solution is not stable on the diagonal. The error happens in:
`D = diag + diag.t() - 2*r`

the diagonal here becomes zero (which is correct), but the gradient for whatever reason become NaN when we do the following command:

`D = D.sqrt()`

Could this mean that the diagonal entries are slightly smaller than 0 and then when we find the square root, they become NaNs? A cheap solution (which seems to work, for now) is to modify that line to:

`D = diag + diag.t() - 2*r + 1e-7`

though, I am not sure if that doesn’t break anything else (I mean, the loss is decreasing, but not sure that all the computations are correct).

On a side note, if I want to normalize the X_similarity matrix, this doesn’t seem to work:

`X_similarity = (X_similarity - X_similarity.mean())/X_similarity.std()`

When I tried on an experiment with tensors, it works but here that X_similarity is a variable, it is not working.

@Ismail_Elezi yes, it could lead to nan because of numerical instability.
I’d say that you don’t really need the `.sqrt()`, I added it to make the function comparable to yours. Also, adding a small epsilon shouldn’t be a problem.

About your second issue, it doesn’t work because the `mean` of a Tensor is a number, but in a Variable it’s a 1D Variable.
We don’t yet have broadcasting in pytorch, so you need to expand it by hand, as discussed in Adding a scalar? .
If you need to insert dimensions to your tensor for that, use the `unsqueeze` function.

1 Like

@fmassa

.sqrt is needed just to make it an Euclidean distance, but anyway, adding the small epsilon seems to solve the problem (alternately, @apaszke idea seems worth investigating).

For the second question, good to know it. I am not sure that I even need it right now, but I’ll look into it if needed.

Thanks for everything man, you’re a lifesaver!

@apaszke BTW, what can I do in order to write a custom backward routine?

1 Like

@edgarriba what’s a custom backward routine? If something is supported by autograd, you have backward for free. If not, you have to add a function as described in the notes. Keep in mind that autograd may do some optimizations that assume that `backward` method computes a correct gradient - if you want to mess it up in any way, use hooks.

@apaszke For example I would like to write a custom forward/backward function for generating triplets. The flow is:

1. Forward pass the N images through net to get features
2. Based on labels information, generate M random/hard triplets (we use features from last layer of network).
3. These triplets go to Loss function (ex. Triplet Loss)
5. As I was generating the triplets and network works on just single image, I need to map the gradient from Loss (of size [M x Feature_Size x 3]) to gradient input to network (size [N x Feature_Size]).
6. Model.backward()
7. Run Optimization step

I implemented it using Lua-Torch and there was not problem with it because I have access to gradient of Loss function and I can manipulate before feeding it to model.

Do you think that AutoGrad will handle it or I need to write custom backward pass?

@apaszke just to check that things are well calculated since until now I’ve not managed to converge the triplet network

I think it should be possible to implement that using autograd.