Change network parameters in-place

PhysicsIsFun · February 24, 2021, 9:41pm

Greetings!

Suppose I want to write my own optimizer, eg. modify torch.optim.sgd.

In the step function of SGD, the parameters p get updated in-place, eg:

            for p in group['params']:
                
                ## SOME CODE

                p.add_(d_p, alpha=-group['lr'])

My first question is why it does not use p.data.add_(d_p, alpha=-group['lr']) as the actual parameter tensor is the data object, is it not?

Also suppose I want to change my parameter p with a computed value x. Is the following correct?

#### compute x
p.data = x

I would still be in the step( ) function, so gradient storage should be turned off due to a line @torch.no_grad() before the step( ) function definition.

Best,
PiF

soulitzer · February 24, 2021, 10:54pm

Hi,

p.data = x should work, but using the .data parameter is generally not recommended. Because yeah, if you already have @torch.no_grad() before the function, p.add_ does what you want in a simpler and more efficient way.

You can think of .data as something similar to detach(), when you ‘get’ the .data attribute, you are actually creating a new tensor that happens to share storage with the old tensor, but does not share the computational graph. If you ‘set’, the .data, attribute, it does handle it underneath so you do modify the original Tensor as you’d expect, but its going to be less efficient because you create more Tensor objects unnecessarily

Also subtle difference to note: something like @torch.no_grad(), doesn’t exactly turn off “gradient storage”. It just means when you perform an op on it, no new nodes get added to its backward graph. For example, you can still perform a .backward() within while @torch.no_grad() and .grad is still populated.

PhysicsIsFun · February 26, 2021, 1:47pm

Thank you soulitzer for the helpful response!

So if I understood you correctly, p.data=x returns a new tensor object p.data that holds the old tensor of p and that is then modiPreformatted textfied to be x. Then Torch imbeds this new tensor back into parameter p. And this process is inefficient as it is not an in-place modification.

What would you recommend instead of p.data = x? I cannot use the add_ function as the new value can’t be trivially written as a sum of the old value and something new.
edit: The computation of x is a bit lengthy, but I guess I can use a chain of in-place operations on my parameters like p.add_().cos_().....

soulitzer · February 26, 2021, 6:44pm

Yep, the inefficiency in p.data=x isn’t the setting part. Its that you allocated a new tensor x first, as opposed to just operating on p in place, which avoids allocating any new tensors.

p.copy_(x) is a safer alternative to p.data=x, so I’d recommend that over it, but since you’re still allocating a new tensor in both cases, it won’t be any more efficient.

The only way to avoid to allocating new tensors is to use the chain of in-place ops like you said.