How does one have the parameters of a model NOT BE LEAFS?

I want the parameters of the model to NOT be leafs. I also want the parameters to be part of the computation graph and I am dynamically changing the parameters with code. So I want:

  • Parameters to NOT be leafs (but still be held inside a nn.Module so that I can use the dynamically changing parameters as part of the computation graph).
  • I want the “updating model” to be part of the computation graph at any time but it seems setting the parameters to leaf is stopping the flow of gradients

How can I do that?

(plan to post later a working self-contained simplified jupyter demo of what more or less is going to make it easier to help me)


SO: https://stackoverflow.com/questions/60271131/how-does-one-have-parameters-in-a-pytorch-model-not-be-leafs-and-be-in-the-compu

2 Likes

Hi,

The general trick to make sure they are not leafs anymore is to first delete de field containing an nn.Parameter. Then you can set it with a Tensor that is not a leaf.

tried what you suggested but it didn’t work…

        del loss_net.fc0.weight
        #setattr(loss_net.fc0,'weight', wt)
        loss_net.fc0.weight = wt

what did you have in mind?


Error msg:

TypeError: cannot assign 'torch.FloatTensor' as parameter 'weight' (torch.nn.Parameter or None expected)

variations that don’t work:

        #del loss_net.fc0.weight
        #setattr(loss_net.fc0, 'weight', nn.Parameter( wt ))
        #setattr(loss_net.fc0, 'weight', wt)
        #loss_net.fc0.weight = wt
        #loss_net.fc0.weight = nn.Parameter( wt )

Hi,

This does work:

m = nn.Linear(10, 10)
del m.weight
m.weight = torch.rand(10)

If you try to print the weight after the del, you should get that you module has no attribute ‘weight’, is that what you get?

yes, I get that error msg:

AttributeError: 'Linear' object has no attribute 'weight'

But I get another error now:

RuntimeError: OrderedDict mutated during iteration

I am looping over all the parameters of my (potentially arbitrary) model. So, I guess python gets unhappy that I’ve mutated it…

@albanD

This is the self contained dummy example/prototype I built to debug this issue:

import torch
import torch.nn as nn

import copy

from collections import OrderedDict

# img = torch.randn([8,3,32,32])
# targets = torch.LongTensor([1, 2, 0, 6, 2, 9, 4, 9])
# img = torch.randn([1,3,32,32])
# targets = torch.LongTensor([1])
x = torch.randn(1)
target = 12.0*x**2

criterion = nn.CrossEntropyLoss()

#loss_net = nn.Sequential(OrderedDict([('conv0',nn.Conv2d(in_channels=3,out_channels=10,kernel_size=32))]))
loss_net = nn.Sequential(OrderedDict([('fc0', nn.Linear(in_features=1,out_features=1))]))

hidden = torch.randn(size=(1,1),requires_grad=True)
updater_net = nn.Sequential(OrderedDict([('fc0',nn.Linear(in_features=1,out_features=1))]))
print(f'updater_net.fc0.weight.is_leaf = {updater_net.fc0.weight.is_leaf}')
#
nb_updates = 2
for i in range(nb_updates):
    print(f'i = {i}')
    new_params = copy.deepcopy( loss_net.state_dict() )
    ## w^<t> := f(w^<t-1>,delta^<t-1>)
    for (name, w) in loss_net.named_parameters():
        print(f'name = {name}')
        print(w.size())
        hidden = updater_net(hidden).view(1)
        print(hidden.size())
        #delta = ((hidden**2)*w/2)
        delta = w + hidden
        wt = w + delta
        print(wt.size())
        new_params[name] = wt
        #del loss_net.fc0.weight
        #print(loss_net.fc0.weight)
        #setattr(loss_net.fc0, 'weight', nn.Parameter( wt ))
        #setattr(loss_net.fc0, 'weight', wt)
        #loss_net.fc0.weight = torch.randn(1)
        #loss_net.fc0.weight = nn.Parameter( wt )
    ##
    loss_net.load_state_dict(new_params)
#
print()
print(f'updater_net.fc0.weight.is_leaf = {updater_net.fc0.weight.is_leaf}')
outputs = loss_net(x)
loss_val = 0.5*(target - outputs)**2
loss_val.backward()
print()
print(f'-- params that dont matter if they have gradients --')
print(f'loss_net.grad = {loss_net.fc0.weight.grad}')
print('-- params we want to have gradients --')
print(f'hidden.grad = {hidden.grad}')
print(f'updater_net.fc0.weight.grad = {updater_net.fc0.weight.grad}')
print(f'updater_net.fc0.bias.grad = {updater_net.fc0.bias.grad}')

Hope it helps us…

its current (incorrect) output:

updater_net.fc0.weight.is_leaf = True
i = 0
name = fc0.weight
torch.Size([1, 1])
torch.Size([1])
torch.Size([1, 1])
name = fc0.bias
torch.Size([1])
torch.Size([1])
torch.Size([1])
i = 1
name = fc0.weight
torch.Size([1, 1])
torch.Size([1])
torch.Size([1, 1])
name = fc0.bias
torch.Size([1])
torch.Size([1])
torch.Size([1])

updater_net.fc0.weight.is_leaf = True

-- params that dont matter if they have gradients --
loss_net.grad = tensor([[9.6197]])
-- params we want to have gradients --
hidden.grad = None
updater_net.fc0.weight.grad = None
updater_net.fc0.bias.grad = None

Hi,

load_state_dict() load the weights in a non-differentiable manner. So you end up with leafs as expected.

Here is a version that works as expected:


def del_attr(obj, names):
    if len(names) == 1:
        delattr(obj, names[0])
    else:
        del_attr(getattr(obj, names[0]), names[1:])
def set_attr(obj, names, val):
    if len(names) == 1:
        setattr(obj, names[0], val)
    else:
        set_attr(getattr(obj, names[0]), names[1:], val)

nb_updates = 2
for i in range(nb_updates):
    print(f'i = {i}')
    new_params = copy.deepcopy( loss_net.state_dict() )
    ## w^<t> := f(w^<t-1>,delta^<t-1>)
    for (name, w) in list(loss_net.named_parameters()):
        hidden = updater_net(hidden).view(1)
        #delta = ((hidden**2)*w/2)
        delta = w + hidden
        wt = w + delta
        del_attr(loss_net, name.split("."))
        set_attr(loss_net, name.split("."), wt)
    ##
#
print()
print(f'updater_net.fc0.weight.is_leaf = {updater_net.fc0.weight.is_leaf}')
print(f'loss_net.fc0.weight.is_leaf = {loss_net.fc0.weight.is_leaf}')
outputs = loss_net(x)
loss_val = 0.5*(target - outputs)**2
loss_val.backward()
print()
print(f'-- params that dont matter if they have gradients --')
print(f'loss_net.grad = {loss_net.fc0.weight.grad}')
print('-- params we want to have gradients --')
print(f'hidden.grad = {hidden.grad}') # None because this is not a leaf, it is overriden in the for loop above.
print(f'updater_net.fc0.weight.grad = {updater_net.fc0.weight.grad}')
print(f'updater_net.fc0.bias.grad = {updater_net.fc0.bias.grad}')
1 Like

@albanD oh wow! The grads are being printed for sure now! Who also but the albanD boss to come to rescue…

Sorry for being demanding, but is it ok if you comment or explain what you did that made things work? Especially this code:

def del_attr(obj, names):
    if len(names) == 1:
        delattr(obj, names[0])
    else:
        del_attr(getattr(obj, names[0]), names[1:])
def set_attr(obj, names, val):
    if len(names) == 1:
        setattr(obj, names[0], val)
    else:
        set_attr(getattr(obj, names[0]), names[1:], val)

Thanks so much again…your a saviour… :muscle: :slight_smile:

self contained script that seems to work:

import torch
import torch.nn as nn

from torchviz import make_dot

import copy

from collections import OrderedDict

# img = torch.randn([8,3,32,32])
# targets = torch.LongTensor([1, 2, 0, 6, 2, 9, 4, 9])
# img = torch.randn([1,3,32,32])
# targets = torch.LongTensor([1])
x = torch.randn(1)
target = 12.0*x**2

criterion = nn.CrossEntropyLoss()

#loss_net = nn.Sequential(OrderedDict([('conv0',nn.Conv2d(in_channels=3,out_channels=10,kernel_size=32))]))
loss_net = nn.Sequential(OrderedDict([('fc0', nn.Linear(in_features=1,out_features=1))]))

hidden = torch.randn(size=(1,1),requires_grad=True)
updater_net = nn.Sequential(OrderedDict([('fc0',nn.Linear(in_features=1,out_features=1))]))
print(f'updater_net.fc0.weight.is_leaf = {updater_net.fc0.weight.is_leaf}')
#
def del_attr(obj, names):
    if len(names) == 1:
        delattr(obj, names[0])
    else:
        del_attr(getattr(obj, names[0]), names[1:])
def set_attr(obj, names, val):
    if len(names) == 1:
        setattr(obj, names[0], val)
    else:
        set_attr(getattr(obj, names[0]), names[1:], val)

nb_updates = 2
for i in range(nb_updates):
    print(f'i = {i}')
    new_params = copy.deepcopy( loss_net.state_dict() )
    ## w^<t> := f(w^<t-1>,delta^<t-1>)
    for (name, w) in list(loss_net.named_parameters()):
        hidden = updater_net(hidden).view(1)
        #delta = ((hidden**2)*w/2)
        delta = w + hidden
        wt = w + delta
        del_attr(loss_net, name.split("."))
        set_attr(loss_net, name.split("."), wt)
    ##
#
print()
print(f'updater_net.fc0.weight.is_leaf = {updater_net.fc0.weight.is_leaf}')
print(f'loss_net.fc0.weight.is_leaf = {loss_net.fc0.weight.is_leaf}')
outputs = loss_net(x)
loss_val = 0.5*(target - outputs)**2
loss_val.backward()
print()
print(f'-- params that dont matter if they have gradients --')
print(f'loss_net.grad = {loss_net.fc0.weight.grad}')
print('-- params we want to have gradients --')
print(f'hidden.grad = {hidden.grad}') # None because this is not a leaf, it is overriden in the for loop above.
print(f'updater_net.fc0.weight.grad = {updater_net.fc0.weight.grad}')
print(f'updater_net.fc0.bias.grad = {updater_net.fc0.bias.grad}')
make_dot(loss_val)

output:

updater_net.fc0.weight.is_leaf = True
i = 0
i = 1

updater_net.fc0.weight.is_leaf = True
loss_net.fc0.weight.is_leaf = False

-- params that dont matter if they have gradients --
loss_net.grad = None
-- params we want to have gradients --
hidden.grad = None
updater_net.fc0.weight.grad = tensor([[0.7152]])
updater_net.fc0.bias.grad = tensor([-7.4249])

Since you have arbitrary name, but you need to be able to delete and set attributes (to delete the nn.Parameter and set a Tensor instead). So this is a recursive function that given a list of all the names like [“foo”, “bar”, “weight”] will either set or delete obj.foo.bar.weight.

AlbanD, this does quite work. I wanted to make sure it worked by printing the computation graph and increasing the nb_updates should increase the computation graph but it doesn’t. I believe it has something to do with the fact we are changing/mutating the OrderDict/params and looping through it at the same time. I tried to fix it by collecting the parameters before and after running the loop but the somehow it thinks its empty for the second loop (i.e. t=1 doesn’t print anything when it should):

import torch
import torch.nn as nn

from torchviz import make_dot

import copy

from collections import OrderedDict

def del_attr(obj, names):
    if len(names) == 1:
        delattr(obj, names[0])
    else:
        del_attr(getattr(obj, names[0]), names[1:])
        
def set_attr(obj, names, val):
    if len(names) == 1:
        setattr(obj, names[0], val)
    else:
        set_attr(getattr(obj, names[0]), names[1:], val)

        
x = torch.randn(1, requires_grad=True)
y = torch.randn(1, requires_grad=True)

criterion = nn.CrossEntropyLoss()

loss_net = nn.Sequential(OrderedDict([('l_fc0', nn.Linear(in_features=1,out_features=1, bias=True))]))
loss_net.l_fc0.weight.requires_grad=False
loss_net.l_fc0.bias.requires_grad=False

hidden = torch.randn(size=(1,1),requires_grad=True)
updater_net = nn.Sequential(OrderedDict([('u_fc0',nn.Linear(in_features=1,out_features=1))]))
updater_net.u_fc0.bias.requires_grad = False
print(f'updater_net.u_fc0.weight.is_leaf = {updater_net.u_fc0.weight.is_leaf}')
#
outputs_virgin = loss_net(x)
params = dict(dict(loss_net.named_parameters()),**{'x':x})
make_dot(outputs_virgin, params=params).render('loss_net_x', format='png')
#
nb_updates = 2
params = list(loss_net.named_parameters())
for t in range(nb_updates):
    print(f't = {t}')
    ## w^<t> := f(w^<t-1>,delta^<t-1>)
    for (name, w) in params:
        delta = updater_net(hidden).view(1)
        wt = w + delta
        print(f'w^<{t}> = {wt}')
        del_attr(loss_net, name.split("."))
        set_attr(loss_net, name.split("."), wt)
    params = list(loss_net.named_parameters())

print()
print(f'updater_net.u_fc0.weight.is_leaf = {updater_net.u_fc0.weight.is_leaf}')
print(f'loss_net.l_fc0.weight.is_leaf = {loss_net.l_fc0.weight.is_leaf}')

outputs = loss_net(x)
loss_val = (outputs - y)**2
loss_val.backward()
print()
print(f'-- params that dont matter if they have gradients --')
print(f'loss_net.grad = {loss_net.l_fc0.weight.grad}')
print('-- params we want to have gradients --')
print(f'hidden.grad = {hidden.grad}') # None because this is not a leaf, it is overriden in the for loop above.
print(f'updater_net.u_fc0.weight.grad = {updater_net.u_fc0.weight.grad}')
#print(f'updater_net.u_fc0.bias.grad = {updater_net.u_fc0.bias.grad}')
params = dict(dict(updater_net.named_parameters()),**{'x':x,'y':y,'hidden':hidden},**dict(loss_net.named_parameters()))
make_dot(loss_val, params=params).render('loss_val', format='png')

Well you should not use loss_net.named_parameters() anymore. we override them. So you should “save” them before doing the first override. Like params = list(loss_net.named_parameters()) before the for-loop.
Then you can set them back (as leafs) when you want to make a new iteration. So maybe extract them as a state_dict and restore them with load_state_dict.

I don’t think we should do that, if we do they would be used as non-leafs in the next iteration and won’t appear in the computation graph properly. It should be a chain of caused by the iterative use of wt.

Let me think about it…

I also thought we could have some special string like the word param in the field when we set it and thus we could loop through all the fields of the object that contain that special string.

Note that if the goal is to do learning through your optimizer step, the higher library already implements this. And has a nice API for for all that :slight_smile:

2 Likes

thanks for that!

I’ve been playing around with the library but was wondering if it was possible to have a trainable step-size with that library. Is it possible?

I tried:

#
child_model = nn.Sequential(OrderedDict([
        ('conv1', nn.Conv2d(in_channels=3,out_channels=2,kernel_size=5)),
        ('relu1', nn.ReLU()),
        ('Flatten', Flatten()),
        ('fc', nn.Linear(in_features=28*28*2,out_features=10) )
    ]))
eta = nn.Sequential(OrderedDict([
    ('fc', nn.Linear(1,1)),
    ('sigmoid', nn.Sigmoid())
]))
inner_opt = torch.optim.Adam(child_model.parameters(), lr=eta)
meta_params = itertools.chain(child_model.parameters(),eta.parameters())
meta_opt = torch.optim.Adam(meta_params, lr=1e-3)

but it failed with error:

Exception has occurred: TypeError
'<=' not supported between instances of 'float' and 'Sequential'

Or even better the update rule be some sort of NN…

starting to think that going back to your settatrr approach might be better X’D…

You can simply replace the optimizer by a nn ?
Or fill-in the .grad fiels with your nn’s result and then use a simple SGD to do the step.