Greetings!
Suppose I want to write my own optimizer, eg. modify torch.optim.sgd.
In the step function of SGD, the parameters p get updated in-place, eg:
for p in group['params']:
## SOME CODE
p.add_(d_p, alpha=-group['lr'])
My first question is why it does not use p.data.add_(d_p, alpha=-group['lr'])
as the actual parameter tensor is the data object, is it not?
Also suppose I want to change my parameter p with a computed value x. Is the following correct?
#### compute x
p.data = x
I would still be in the step( ) function, so gradient storage should be turned off due to a line @torch.no_grad()
before the step( ) function definition.
Best,
PiF
1 Like
Hi,
p.data = x
should work, but using the .data
parameter is generally not recommended. Because yeah, if you already have @torch.no_grad()
before the function, p.add_
does what you want in a simpler and more efficient way.
You can think of .data
as something similar to detach()
, when you âgetâ the .data
attribute, you are actually creating a new tensor that happens to share storage with the old tensor, but does not share the computational graph. If you âsetâ, the .data
, attribute, it does handle it underneath so you do modify the original Tensor as youâd expect, but its going to be less efficient because you create more Tensor objects unnecessarily
Also subtle difference to note: something like @torch.no_grad()
, doesnât exactly turn off âgradient storageâ. It just means when you perform an op on it, no new nodes get added to its backward graph. For example, you can still perform a .backward()
within while @torch.no_grad()
and .grad
is still populated.
2 Likes
Thank you soulitzer for the helpful response!
So if I understood you correctly, p.data=x
returns a new tensor object p.data
that holds the old tensor of p and that is then modiPreformatted text
fied to be x. Then Torch imbeds this new tensor back into parameter p. And this process is inefficient as it is not an in-place modification.
What would you recommend instead of p.data = x
? I cannot use the add_
function as the new value canât be trivially written as a sum of the old value and something new.
edit: The computation of x is a bit lengthy, but I guess I can use a chain of in-place operations on my parameters like p.add_().cos_()....
.
Yep, the inefficiency in p.data=x
isnât the setting part. Its that you allocated a new tensor x
first, as opposed to just operating on p in place, which avoids allocating any new tensors.
p.copy_(x)
is a safer alternative to p.data=x
, so Iâd recommend that over it, but since youâre still allocating a new tensor in both cases, it wonât be any more efficient.
The only way to avoid to allocating new tensors is to use the chain of in-place ops like you said.
1 Like