Detach, no_grad and requires_grad

Hello
It’s general question but currently I’m looking at tutorial:
http://pytorch.org/tutorials/intermediate/reinforcement_q_learning.html

why in optimize_model function - we have
detach() method
and in select_action we use
with torch.no_grad():

on the other hand doc:
http://pytorch.org/docs/stable/notes/autograd.html
mentions only requires_grad

of course I understand we don’t want to compute gradients here - but i don’t fully understand the difference between all those 3 methods…
also If i’m not mistaken in previous versions of pytorch we used volatile=true which was considered more memory efficient (please correct me if i’m wrong) which is now replaced by with torch.no_grad():

so if we used with torch.no_grad(): in optimize_mode, would it be also ok?

could anyone please explain it to me?

thank you

best,
Michael

14 Likes

Hi,

Does this post answer your question?

2 Likes

Hello albanD

thank you for your response.

this answer you mentioned helped me understand general idea but still i don’t fully understand the difference between torch.no_grad and detach
if i can use it exchangeably - or are there any limitations?

thx
Michael

1 Like

detach() detaches the output from the computationnal graph. So no gradient will be backproped along this variable.
torch.no_grad says that no operation should build the graph.

The difference is that one refers to only a given variable on which it’s called. The other affects all operations taking place within the with statement.

38 Likes

good explanation. Can you give some use cases for both? I am guessing you’d use torch.no_grad during eval phase? But what would you use detach() for? Any specific use cases would help understand them better

1 Like

torch.no_grad yes you can use in eval phase in general.

detach() on the other hand should not be used if you’re doing classic cnn like architectures. It is usually used for more tricky operations.
detach() is useful when you want to compute something that you can’t / don’t want to differentiate. Like for example if you’re computing some indices from the output of the network and then want to use that to index a tensor. The indexing operation is not differentiable wrt the indices. So you should detach() the indices before providing them.

15 Likes

Thank you albanD

definitely it helps!

so if i understand properly this part

(please anyone correct me if i’m wrong) we can use both method exchangeably and it should have same result

so

        with torch.no_grad():
            return policy_net(state).max(1)[1].view(1, 1)

should have similar effect as

       return policy_net(state).detach().max(1)[1].view(1, 1)

thanx again for your help!

5 Likes

The returned result will be the same, but the version with torch.no_grad will use less memory because it knows from the beginning that no gradients are needed so it doesn’t need to keep intermediary results.

11 Likes

You’re right. And this is huge difference.

Thank you

all clear now

2 Likes

@albanD Hi, does detach erase the existing weights?

So if we do:

policy_loss = …
self.policy_optimizer.zero_grad()
policy_loss.backward()
self.policy_optimizer.step(0
policy_loss.detach()

For some reason when I was doing this in a loop the weights of the policy.parameters() [connected to the policy optimizer] stopped changing. I don’t understand the mechanism for why.

The only thing detach does is to return a new Tensor (it does not change the current one) that does not share the history of the original Tensor.
If the only thing you added to your code is policy_loss.detach() then it does not do anything to your code as you don’t use the result and detach does not change anything else.

5 Likes

@albanD I see, any pointers for why the weights aren’t changing? I notice first the biases stop changing, then more and more hidden layers stop changing incrementally.

The gradients were on the order of E-2 so not super small, but perhaps combined with the 3E-4 learning rate the update becomes numerically ~0 and ignored?

That looks like this can happen, that would give you an update around 1e-6 which is where float number start to become meaningless.

Assume I have a network in two parts like:

def forward(self, x):
  x = self.part_1(x)
  return self.part_2(x)

If I want to update only the second part in training phase, will this modification do the desired thing?

def forward(self, x):
  with torch.no_grad():
    x = self.part_1(x)
  return self.part_2(x)
3 Likes

Yes, your code will work.
Alternatively, you could call .detach() on x before passing it to part2:

def forward(self, x):
    x = self.part_1(x)
    return self.part_2(x.detach())
10 Likes

In torch >= 1.0.0, what is the correct way to copy the data of a tensor so that the copy “is just a copy” (i.e. a standalone tensor that doesn’t know anything about graphs, gradients etc.)?

x = my_tensor.detach().clone()
x = my_tensor.clone().detach()

# >>> Now, the two lines above, but augmented with .data in all possible places:
x = my_tensor.data.detach().clone()
x = my_tensor.detach().data.clone()
# ...
x = my_tensor.clone().detach().data
# <<<

It looks like all of these options solve the problem, don’t they? If it’s true, than the options with .data are all redundant and the question is .detach().clone() vs .clone().detach().

One more case:

def forward(self,x):
  x = self.root(x)
  out1 = self.branch_1(x)
  out2 = self.branch_2(x.detach())
  return out1, out2

loss = F.l2_loss(out1, target1) + F.l2_loss(out2, target2)
loss.backward()

I want the gradients for the branch1 to update the parameters of the root and branch1. But the gradients of the second branch should only update the branch_2 parameters. I guess the detach usage will make it happen, but I want to be sure?

2 Likes

Hi,

Yes this will do what you want !

4 Likes

For example in Faster R-CNN gradients do not backprop through proposals; therefore, once we have computed the proposals in RPN, we can detach them and use the detached tensors (ROIs) for the rest of the networks (https://github.com/pytorch/vision/blob/8dfcff745a5bd2d4886716bf0deff8dcc8e75fed/torchvision/models/detection/rpn.py#L491).

Another usecase is that if there are auxilary losses in the network and if we do not detach the network in proper points (just after these auxiary points), gradients of some parameters’ will be calculated multiple times, which we may not want.