Autograd Question

adarsh-kr · November 23, 2018, 8:15pm

Let’s say I have a model, which has two fully connected layers. in the forward method, I am arbitrarily changing the output tensor after first layer by squaring the tensor.

I want to understand, how will the autograd work in this case? Will this transformation affect the learned weight vectors? ideally it should. What if I don’t want to take the effect of this transformation on the gradient?

I am trying to reason along following lines :

The out tensor has requires_grad = False. So this is not going to get updated. But autograd will compute gradient of out*out w.r.t. out to propagate the gradients. right ?
The transformation out = out*out, is not having any parameters.
What if I want to apply a transformation but don’t want it to have effect on my previous layers parameters. How to achieve that ?

class NeuralNetwork(nn.Module):
     def __init__(self):
        super(NeuralNet, self).__init__()

        self.fc1 = nn.Linear(10,20)
        self.fc2 = nn.Linear(20,10) 

    def forward(self, inp):

        out = self.fc1(inp)

        # transform the out vector 
        out = out * out 

        out = self.fc2(out)

        return out

Diego · November 23, 2018, 8:19pm

Im not sure this will work, but could you please try:

with torch.no_grad():
    out = out * out
out = self.fc2(out)

Let me know if this solves your problem. In principle the operations realized under torch.no_grad() do not compute gradients

bdusell · November 23, 2018, 8:28pm

It sounds like what you want to do is to .detach() one of the arguments to the squaring operation, so that gradient only propagates to out once. Try:

out = out * out.detach()

adarsh-kr · November 23, 2018, 9:10pm

I am unable to wrap my head around how the back propagation will work after doing this. I mean the gradient w.r.t. parameters in FC layer1 will depend on out value. But out value has changed.

Basically, W (a parameter in FC layer 1) has gradient = dLoss/dOut * dOut/dW. Now, which out value will be used for this? out or out = out*out ?

Diego · November 23, 2018, 10:20pm

If I understand correctly if you put the with torch.no_grad() line the out value is the one that will be used to compute the gradients. Since the operation out * out will not be added to the computational graph. Again I am not 100% sure about this. Maybe @ptrblck can offer his alwaysvaluable insight. Best of luck.

InnovArul · November 24, 2018, 1:19pm

As I understand, you would like to change the intermediate output in some way (I call it changed_out) and you do not want the gradient to be accounted for that operation. Also, changed_out is going to the input for the other layers down the line.

I think, it will lead to disconnected networks (network1 --> intermediate operation --> network2). Unless network1 gets gradient with respect to its output, it is not going to be trained. I am not sure about the usecase why such operation is needed. Maybe if you could explain a bit more on the usecase, you would get valuable answers.

adarsh-kr · November 24, 2018, 6:12pm

I want to measure the accuracy of the network after transforming previous layer outputs. So I am trying to create a new layer, but for that I need to create a backward function.

InnovArul · November 24, 2018, 8:28pm

If you do not want the previous layers to be learned again with the transformation and if you are willing to freeze the previous layer weights, you do not need to think about backward function.
If not, I do not see a way for getting around backward function, as of now.

adarsh-kr · November 24, 2018, 8:45pm

I actually do not want to do this while training. I just want to try some transformation during inference. I have a function Transformation which takes numpy.ndarray, and it will return the same. I just want to update the data of previous_layer_out tensor using this Transformation function. Should I just do this :

out = Transformation(previous_layer_out.numpy())
previous_layer_out = torch.from_numpy(out)

InnovArul · November 24, 2018, 8:46pm

ok. If you only need it for inference, yes, that should be enough. What error are you facing?