When is autograd available and is it always accurate?

I’ve been trying to wrap my head around autograd for a while now. From what I’ve learned about automatic differentiation in class, it is possible to compute the gradient of functions written with code even when doing overwrites, for loops, etc… Additionally, the autograd tutorial states:

The autograd package provides automatic differentiation for all operations on Tensors.

However, I understand that not everything is differentiable (e.g. a black-box function).

I’m currently reimplementing in PyTorch a neural network with a layer that performs a lookup operation in a table with an index derived from a prediction. That is, some floating-point input is rounded and cast as an integer in order to be used as an array index. Here is a mockup of what could be the forward pass of this layer:

    class LookupLayer(Module):
        def __init__(self):
            super().__init__()
            self.lookup_array = torch.arange(10)
            
        def forward(self, x):
            index = torch.round(x).int()
            index = torch.max(0, torch.min(len(self.lookup_array) - 1, index))
            return self.lookup_array[index]

The grad attribute of the parameters of the model for all layers above this one is None, which I assume is due to autograd being unable to compute the derivative of this layer. That makes sense to me, as I don’t think that int() has a derivative.

The author of the paper I am reproducing provides the backward pass implementation of the layer and so I believe that I should now put my layer as a subclass of torch.autograd.Function rather than torch.nn.Module and reimplement the backward pass there myself. Please correct me if I am wrong.

My questions are:

  • Is it trivial to tell if autograd is going to be able to derive the function or not?

  • If the forward pass is correct and if autograd is able to compute a gradient, is this gradient always correct?

I’ve watched this lecture to try to find answers to these questions, but I could not quite understand everything. If you have more resources I would be happy to have them.

That is, some floating-point input is rounded and cast as an integer in order to be used as an array index

I think the resulting gradient is 0 in that case. I think even if you would do something less “excotic” like predicting some output rounding the logits, and then use it to compute the MSE against a target label, it would give you a gradient of 0.

From what I’ve learned about automatic differentiation in class, it is possible to compute the gradient of functions written with code even when doing overwrites

I am by no means an expert in how autodiff is implemented, but I think I heard somewhere that everything in a computer should be differentiable because you can decompose everything into atomic operations (basically additions). But I am not sure that it’s implemented like that in the autodiff submodule. I have actually no idea how it’s implemented in PyTorch’s submodule, but if I had to speculate, based on what I’ve seen, it’s really manually defined for certain operations.

Hi,

For your first question, no there is no way at the moment. But it is quite “obvious” which function won’t give you usefull gradients. In particular functions that are piecewise constant (rounding, indexing etc) will give an almost always zero gradient and so won’t be useful.

For your second question: the autograd engine is based on the chain rule:

If you have f and g two differentiable functions, and you do y=f(x) and z=g(y), then you have dx/dz = dx/dy * dy/dz.

You can apply this multiple times until each derivative is only computed for an “elementary” function.
Then you simply need to implement the derivatives of these elementary functions and you get the derivative for the whole succession of functions.

So the gradients computed by pytorch will alway be correct as long as the condition above “f and g are differentiable” is valid. If they are not, it can break (or not break) in many unexpected ways and hells ensues :smiling_imp:
In practice it should behave properly for these cases as well but you can find single points where the computed gradient will be wrong (and we cannot detect that automatically…)

Do I understood it correctly, that even if I didn’t specified a backward function the autograd would be able to provide derivates w.r.t. inputs of that function?