Detach, no_grad and requires_grad

michaelzet · April 25, 2018, 5:29am

Hello
It’s general question but currently I’m looking at tutorial:
http://pytorch.org/tutorials/intermediate/reinforcement_q_learning.html

why in optimize_model function - we have
detach() method
and in select_action we use
with torch.no_grad():

on the other hand doc:
http://pytorch.org/docs/stable/notes/autograd.html
mentions only requires_grad

of course I understand we don’t want to compute gradients here - but i don’t fully understand the difference between all those 3 methods…
also If i’m not mistaken in previous versions of pytorch we used volatile=true which was considered more memory efficient (please correct me if i’m wrong) which is now replaced by with torch.no_grad():

so if we used with torch.no_grad(): in optimize_mode, would it be also ok?

could anyone please explain it to me?

thank you

best,
Michael

albanD · April 25, 2018, 11:04am

Hi,

Does this post answer your question?

michaelzet · April 25, 2018, 11:31am

Hello albanD

thank you for your response.

this answer you mentioned helped me understand general idea but still i don’t fully understand the difference between torch.no_grad and detach
if i can use it exchangeably - or are there any limitations?

thx
Michael

albanD · April 25, 2018, 12:28pm

detach() detaches the output from the computationnal graph. So no gradient will be backproped along this variable.
torch.no_grad says that no operation should build the graph.

The difference is that one refers to only a given variable on which it’s called. The other affects all operations taking place within the with statement.

Irfan_Bulu · April 25, 2018, 12:33pm

good explanation. Can you give some use cases for both? I am guessing you’d use torch.no_grad during eval phase? But what would you use detach() for? Any specific use cases would help understand them better

albanD · April 25, 2018, 12:36pm

torch.no_grad yes you can use in eval phase in general.

detach() on the other hand should not be used if you’re doing classic cnn like architectures. It is usually used for more tricky operations.
detach() is useful when you want to compute something that you can’t / don’t want to differentiate. Like for example if you’re computing some indices from the output of the network and then want to use that to index a tensor. The indexing operation is not differentiable wrt the indices. So you should detach() the indices before providing them.

michaelzet · April 26, 2018, 2:30pm

Thank you albanD

definitely it helps!

so if i understand properly this part

(please anyone correct me if i’m wrong) we can use both method exchangeably and it should have same result

so

        with torch.no_grad():
            return policy_net(state).max(1)[1].view(1, 1)

should have similar effect as

       return policy_net(state).detach().max(1)[1].view(1, 1)

thanx again for your help!

albanD · April 26, 2018, 2:54pm

The returned result will be the same, but the version with torch.no_grad will use less memory because it knows from the beginning that no gradients are needed so it doesn’t need to keep intermediary results.

michaelzet · April 26, 2018, 5:07pm

You’re right. And this is huge difference.

Thank you

all clear now

whoab · March 19, 2019, 10:01am

@albanD Hi, does detach erase the existing weights?

So if we do:

policy_loss = …
self.policy_optimizer.zero_grad()
policy_loss.backward()
self.policy_optimizer.step(0
policy_loss.detach()

For some reason when I was doing this in a loop the weights of the policy.parameters() [connected to the policy optimizer] stopped changing. I don’t understand the mechanism for why.

albanD · March 19, 2019, 10:27am

The only thing detach does is to return a new Tensor (it does not change the current one) that does not share the history of the original Tensor.
If the only thing you added to your code is policy_loss.detach() then it does not do anything to your code as you don’t use the result and detach does not change anything else.

whoab · March 19, 2019, 10:52am

@albanD I see, any pointers for why the weights aren’t changing? I notice first the biases stop changing, then more and more hidden layers stop changing incrementally.

The gradients were on the order of E-2 so not super small, but perhaps combined with the 3E-4 learning rate the update becomes numerically ~0 and ignored?

albanD · March 19, 2019, 10:55am

That looks like this can happen, that would give you an update around 1e-6 which is where float number start to become meaningless.

Xonobo_Xonobo · March 27, 2019, 11:44am

Assume I have a network in two parts like:

def forward(self, x):
  x = self.part_1(x)
  return self.part_2(x)

If I want to update only the second part in training phase, will this modification do the desired thing?

def forward(self, x):
  with torch.no_grad():
    x = self.part_1(x)
  return self.part_2(x)

ptrblck · March 29, 2019, 12:11am

Yes, your code will work.
Alternatively, you could call .detach() on x before passing it to part2:

def forward(self, x):
    x = self.part_1(x)
    return self.part_2(x.detach())

StrausMG · April 7, 2019, 2:54pm

In torch >= 1.0.0, what is the correct way to copy the data of a tensor so that the copy “is just a copy” (i.e. a standalone tensor that doesn’t know anything about graphs, gradients etc.)?

x = my_tensor.detach().clone()
x = my_tensor.clone().detach()

# >>> Now, the two lines above, but augmented with .data in all possible places:
x = my_tensor.data.detach().clone()
x = my_tensor.detach().data.clone()
# ...
x = my_tensor.clone().detach().data
# <<<

It looks like all of these options solve the problem, don’t they? If it’s true, than the options with .data are all redundant and the question is .detach().clone() vs .clone().detach().

Xonobo_Xonobo · July 22, 2019, 4:26am

One more case:

def forward(self,x):
  x = self.root(x)
  out1 = self.branch_1(x)
  out2 = self.branch_2(x.detach())
  return out1, out2

loss = F.l2_loss(out1, target1) + F.l2_loss(out2, target2)
loss.backward()

I want the gradients for the branch1 to update the parameters of the root and branch1. But the gradients of the second branch should only update the branch_2 parameters. I guess the detach usage will make it happen, but I want to be sure?

albanD · July 22, 2019, 9:05am

Hi,

Yes this will do what you want !

GorkemP · September 22, 2020, 11:02pm

For example in Faster R-CNN gradients do not backprop through proposals; therefore, once we have computed the proposals in RPN, we can detach them and use the detached tensors (ROIs) for the rest of the networks (https://github.com/pytorch/vision/blob/8dfcff745a5bd2d4886716bf0deff8dcc8e75fed/torchvision/models/detection/rpn.py#L491).

Another usecase is that if there are auxilary losses in the network and if we do not detach the network in proper points (just after these auxiary points), gradients of some parameters’ will be calculated multiple times, which we may not want.

Tejan_Mehndiratta · April 21, 2021, 4:08pm

So, If I do .detach() on a variable, then that variable will never be in the computation graph. Right?