@Ismail_Elezi because the targets are constants, you don’t need to compute the gradients through Y. You can say that your effective target is actually Y, and not y, and Y does not require gradient (requires_grad=False).
So all you need to do is do the operations converting y in Y without using Variables, and then wrap the resulting Y in a variable.
For reference, here is an implementation of convert_y that does not require a for loop, and can be efficiently performed.
s = y.size(0)
y_expand = y.unsqueeze(0).expand(s, s)
Y = y_expand.eq(y_expand.t())
I think that I still need to fully understand how these functions work (read them in details), but everything is working now. Of course, the ANN isn’t working (a lot of NANs immediately after the first iteration), but that is something that I need to investigate and see the gradients’ values.
I am using the version that is written on this thread, and is giving NaNs after the first iteration. Typically, when I got Nans in the past it was either an error in differentiation or a large training rate. Both of these cannot be this time because it is automatic differentiation and I am using extremely small training rates (3e-7 in Adam, while typically I use 3e-4).
Anyway, I will have to read the documentation to find how to print the gradients, which might give me an idea on what is going wrong. And then, if I have problems, I will make an another thread.
x = Variable(torch.randn(5, 5), requires_grad=True)
y = Variable(torch.randn(5, 5), requires_grad=True)
z = x + y
# this will work only in Python3
z.register_hook(lambda g: print(g))
# if you're using Python2 do this:
# def pring_grad(g):
# print g
q = z.sum()
This will print the gradient w.r.t. z at each backward.
Great talk about defining the own Loss function. I would love to have a simple example of creating own loss.
I’m too confused about the idea of creating the loss function in torch. As author, I do not realy get what we really need to do or which function we can use if we want to define own loss function. It would be nice to have a complete example for “Similarity-Matrix” or sth like “Triplet-Loss”.
About NaNs is your results: it is related to Loss Function. I was implementing sth like that in TensorFlow and I get NaNs too. Based on experiments, it look like the gradient of diagonal in similarity matrix cause NaNs. I modified to skip diagonal (so take just right upper triangle matrix, as left down is just the same).
Having a tensor filled with ones is expected in my example, because that’s the gradient w.r.t. z. You can register the hook on any Variable of which gradient you want to inspect. If you want all gradients of everything you will need to register a hook on every intermediate output.
@melgor it has been giving NaNs in the previous @fmassa’s implementation, but the solution posted now should be quite stable, even on the diagonal. But to be sure it is safer to recostruct it from TRIL matrix.
@fmassa solution is not stable on the diagonal. The error happens in: D = diag + diag.t() - 2*r
the diagonal here becomes zero (which is correct), but the gradient for whatever reason become NaN when we do the following command:
D = D.sqrt()
Could this mean that the diagonal entries are slightly smaller than 0 and then when we find the square root, they become NaNs? A cheap solution (which seems to work, for now) is to modify that line to:
D = diag + diag.t() - 2*r + 1e-7
though, I am not sure if that doesn’t break anything else (I mean, the loss is decreasing, but not sure that all the computations are correct).
On a side note, if I want to normalize the X_similarity matrix, this doesn’t seem to work:
@Ismail_Elezi yes, it could lead to nan because of numerical instability.
I’d say that you don’t really need the .sqrt(), I added it to make the function comparable to yours. Also, adding a small epsilon shouldn’t be a problem.
About your second issue, it doesn’t work because the mean of a Tensor is a number, but in a Variable it’s a 1D Variable.
We don’t yet have broadcasting in pytorch, so you need to expand it by hand, as discussed in Adding a scalar? .
If you need to insert dimensions to your tensor for that, use the unsqueeze function.
@edgarriba what’s a custom backward routine? If something is supported by autograd, you have backward for free. If not, you have to add a function as described in the notes. Keep in mind that autograd may do some optimizations that assume that backward method computes a correct gradient - if you want to mess it up in any way, use hooks.