Neat way of temporarily disabling grads for a model?

Sam_Pr · May 11, 2020, 11:59pm

I’m implementing a version of DDPG, and trying to calculate the policy loss.

# TODO: Should find grad for states, policy_actions but treat q_model params as constant
policy_loss = -self.q_model(states, policy_actions).mean()

I can do this by looping through q_model params, storing current values, setting requires_grad to False, calculating policy_loss, then restoring after.

orig_requires_grads = [p.requires_grad for p in self.q_model.parameters()]
for p in self.q_model.parameters(): p.requires_grad = False
policy_loss = -self.q_model(states, policy_actions).mean()
for p, rg in zip(self.q_model.parameters(), orig_requires_grads): p.requires_grad = rg

This feels clumsy. Is there a better way to temporarily disable a model like this? I looked for something like self.q_model.detach()(states, policy_actions).mean() but was surprised to find it doesn’t exist. Obviously I could make a helper.

I plan on moving this model over to TorchScript at some point, and I’m not sure whether this is going to work out there?

There are some old threads that indicate that manual looping is the best way, but I want to check in case any developments have happened here.

github.com/pytorch/pytorch

[Feature request] detach (fix?) a nn.module

opened 07:50AM - 07 Sep 17 UTC

closed 10:28PM - 12 Mar 20 UTC

ruotianluo

triage review

Suppose, we have a siamese network. ``` out1 = net(in1) out2 = net(in2) ``` … However, we want the gradient of parameters in `net` only affected by the first branch (note that suppose we still want the gradient of `in1` and `in2`), then what we should do here is: ``` out1 = net(in1) for p in net.parameters(): p.requires_grad=False out2 =net(in2) for p in net.parameters(): p.requires_grad=True fn(out1, out2).backward() ``` So, can we have a `detach` function for nn.module: So, when we don't want to update a net, we can do: `net.detach_()` (Here, detach_() just set all the parameters requires_grad to be False) If we want to do the example I mention above, we could do: ``` out1 = net(in1) out2 = net.detach()(in2) fn(out1, out2).backward() ``` (here, detach() is a clone of the input network but with all parameters requires_grad to be False) (I give the name detach because it's essentially detach the module Parameters)

albanD · May 12, 2020, 5:09pm

Hi,

I don’t think there is any update. The for loop is simple and is the most efficient thing that can be done here.
Especially with your special logic of things already not requiring gradients, that would be tricky.

Note that you can add a method to your q_model module yourself to do that to make it a bit cleaner.

Sam_Pr · May 13, 2020, 9:21pm

Thanks for letting me know. Yeah, what I’ve got for now is a with_grad param in the forward method of my q_model that does what I showed. I wasn’t sure if there was a way to prevent the grad propagating through the graph without setting state like that, but it seems like there isn’t so this will have to do. I would still prefer a declarative rather than procedural API here, but it’s obviously not important.

albanD · May 13, 2020, 9:52pm

Well if you don’t want any gradient flowing, you can use torch.no_grad(). But that means that things above won’t get gradients either.

Sam_Pr · May 13, 2020, 11:25pm

In this case, the arguments given to the model should have their grads updated but the model’s own parameters shouldn’t. None of the no_grad systems allow for that specificity do they?

To explain the context, in DDPG we have a Q model which learns predicted rewards for state/action pairs, and a policy model which learns to choose the policy for a state. The Q model is trained from observed rewards. The policy model is trained by feeding its output into the Q model and doing gradient ascent to tweak the policy in the direction of a larger predicted reward. These both happen side-by-side in the same training loop.

The tricky bit is feeding the policy actions into the Q model to do the gradient ascent without also training the Q model. The requires_grad method works, but doing it in a more declarative way would require some way to tell it to only calculate grads back through a specific tensor or something like that. Like with torch.no_grad(except = policy_actions):

Sam_Pr · May 13, 2020, 11:35pm

Oops. It doesn’t particularly matter for the discussion, but there’s a little bug in the comment in my first post. It should only pass the grads back through policy_actions, not through states too. That’s a mistake that slipped in when simplifying the code to post on here, but apparently the forum won’t let me edit my first post. My real code does states.detach() there.

albanD · May 14, 2020, 2:03pm

No they don’t. I just wanted to mention it in case future reader are in a different scenario.

Deepesh7 · December 31, 2020, 7:39pm

I am new to pytorch.
So could anyone please help and clarify, if the code below would pause the computation graph at line 2(‘for p in self.q_model.parameters(): p.requires_grad = False’) and resume the graph after line 4(for p, rg in zip(self.q_model.parameters(), orig_requires_grads): p.requires_grad = rg)?

Sam_Pr:

orig_requires_grads = [p.requires_grad for p in self.q_model.parameters()]
for p in self.q_model.parameters(): p.requires_grad = False
policy_loss = -self.q_model(states, policy_actions).mean()
for p, rg in zip(self.q_model.parameters(), orig_requires_grads): p.requires_grad = rg

I am trying to implement a custom loss function, where I need to pre-process output of network which involves converting tensor to numpy and then converting it back to a tensor, then finally feed it to a log loss function. Will using the code above while pre-processing, help compute gradients correctly after the pre-processing part?