Build your own loss function in PyTorch

@melgor it has been giving NaNs in the previous @fmassa’s implementation, but the solution posted now should be quite stable, even on the diagonal. But to be sure it is safer to recostruct it from TRIL matrix.


Thanks, that makes sense and now I found the error.

@apaszke and @melgor

@fmassa solution is not stable on the diagonal. The error happens in:
D = diag + diag.t() - 2*r

the diagonal here becomes zero (which is correct), but the gradient for whatever reason become NaN when we do the following command:

D = D.sqrt()

Could this mean that the diagonal entries are slightly smaller than 0 and then when we find the square root, they become NaNs? A cheap solution (which seems to work, for now) is to modify that line to:

D = diag + diag.t() - 2*r + 1e-7

though, I am not sure if that doesn’t break anything else (I mean, the loss is decreasing, but not sure that all the computations are correct).

On a side note, if I want to normalize the X_similarity matrix, this doesn’t seem to work:

X_similarity = (X_similarity - X_similarity.mean())/X_similarity.std()

When I tried on an experiment with tensors, it works but here that X_similarity is a variable, it is not working.

@Ismail_Elezi yes, it could lead to nan because of numerical instability.
I’d say that you don’t really need the .sqrt(), I added it to make the function comparable to yours. Also, adding a small epsilon shouldn’t be a problem.

About your second issue, it doesn’t work because the mean of a Tensor is a number, but in a Variable it’s a 1D Variable.
We don’t yet have broadcasting in pytorch, so you need to expand it by hand, as discussed in Adding a scalar? .
If you need to insert dimensions to your tensor for that, use the unsqueeze function.

1 Like


.sqrt is needed just to make it an Euclidean distance, but anyway, adding the small epsilon seems to solve the problem (alternately, @apaszke idea seems worth investigating).

For the second question, good to know it. I am not sure that I even need it right now, but I’ll look into it if needed.

Thanks for everything man, you’re a lifesaver!

@apaszke BTW, what can I do in order to write a custom backward routine?

1 Like

@edgarriba what’s a custom backward routine? If something is supported by autograd, you have backward for free. If not, you have to add a function as described in the notes. Keep in mind that autograd may do some optimizations that assume that backward method computes a correct gradient - if you want to mess it up in any way, use hooks.

@apaszke For example I would like to write a custom forward/backward function for generating triplets. The flow is:

  1. Forward pass the N images through net to get features
  2. Based on labels information, generate M random/hard triplets (we use features from last layer of network).
  3. These triplets go to Loss function (ex. Triplet Loss)
  4. Loss.backward() return the gradient.
  5. As I was generating the triplets and network works on just single image, I need to map the gradient from Loss (of size [M x Feature_Size x 3]) to gradient input to network (size [N x Feature_Size]).
  6. Model.backward()
  7. Run Optimization step

I implemented it using Lua-Torch and there was not problem with it because I have access to gradient of Loss function and I can manipulate before feeding it to model.

Do you think that AutoGrad will handle it or I need to write custom backward pass?

@apaszke just to check that things are well calculated since until now I’ve not managed to converge the triplet network

I think it should be possible to implement that using autograd.

Hi Adam,

If we were to define our own loss function, and during the middle there’s a sort and index_copy_ action that requires Tensor instead of Variable. Seems like unpacking data is unavoidable. Do you have any suggestions? Thanks.

Hi, can you give an example?

for examples see the posts above.

Hi Adam,
You mentioned that autograd will work if I only use Variables without any .data unpacking to retrieve the underlying Tensor. Is this also the case if I call .data on a Variable to get the tensor in order to perform some ancillary computation (such as calculating a random value conditioned on the feature vector) but the final output computation is done on the Variable itself? Will autograd work then or will the call to .data mess with the computation graph?

It’s all fine, as long as you don’t re-wrap the .data in a new Variable, and expect it to propagate properly to the input/weight (it won’t - the graph will be broken). If you just want to take the tensor, and compute stats, etc. but not differentiate it, then you’re good.


I also try to write my own loss function, but it fail. The error is float division by zero… So sad, I don’t know why.

@Gwan-Siu maybe you are dividing by zero somewhere. Look at your division operations in your loss function and add a small value to the denominator, like 1e-8

1 Like

Hey there… I have a quick question regarding this.If I pass in a variable x that has two columns

def forward(self,x):
    a = torch,exp(x[:, 0])
    b = torch.nn.Softplus()(x[:, 1])
    activation =, 1), b.view(-1, 1)), 1)
    return activation 

Since I am slicing the tensor, the gradients wont be calculated with respect to the input?


They will be, wrt the rows that are propagating gradients back. The other rows will have zero gradients right…

1 Like

Hello! The similarity loss you described is interesting. Which paper proposed the loss? Or it’s your own idea?
Thank you!

For learning purposes, can we just have an example with a though explanation on how to write our own loss function on a simple example :slight_smile: