I need to have trainable variables in the backward hooks of a feed forward net. To give more information, I am manipulating the gradients coming from multiple branches (tasks) of a single net. My apologies for not able to give more information. In a simple example, I tried having a variable and a separate optimizer (opt2) for that (so that I can update independent of the net parameters tied to opt1), but, the loss is giving None gradients for my variable (opt2). I tried searching for some help , couldn’t find any, please point to some useful discussions if any.
I see that variable is not participating in the graph (forward operation). If this is the reason why the gradient is not computed, how I can formulate my objective as a direct function of this special variables?
Thanks in advance.
If the function is not used during the forward pass, then it is expected that it’s gradient will be 0. Can you write down how this variable is affecting your loss?
Thanks for replying. Yes, I see what you said. Since my variable only affects the backward operation, it is not used in the forward pass operation. Say the following is my forward operation
self.c1 = self.pool(F.relu(self.conv1(x)))
self.c2 = self.pool(F.relu(self.conv2(self.c1)))
self.c2 = self.c2.view(-1, 16 * 5 * 5)
self.f1 = F.relu(self.fc1(self.c2))
self.f2 = F.relu(self.fc2(self.f1))
self.c_out = self.fc3(self.f2[:self.f2.size(0)/2,:])
h=self.c_out.register_hook(lambda grad: grad * alpha)
self.f_out = self.fc4(self.f2[self.f2.size(0)/2:,:])
h=self.f_out.register_hook(lambda grad: grad * (1.0-alpha))
How can I formulate an objective (classification loss) as a function of alpha ? Please let me know if you need more info. Thanks!
If your classification loss is some cross entropy, then it’s only based on the output of the network and the ground truth labels. It is not a function of the gradients and so independent of your alpha.
Your alpha only changes the gradients, and you classification is only changed by the output. So the two looks quite independent to me.
True. However, the resulting gradients make the net parameters. So, if I want to find the optimal combination of the two gradients (of c_out and f_out) as opposed to the plain accumulation at f2, I should be able to pose it as a classification loss minimization achieved by the resulting parameters (after the scaled backprop operation). In other words, I want to compute the gradient for alpha during the backward operation. I am sure its a bit loopy in mind as of now.