If we combine one trainable parameters with a non-trainable parameter, is the original trainable param trainable?

Yes, that will work.

hmmm but I think I would need to pass to the optimizer the weights of the trainable net, right? Otherwise who knows what would happen…does it matter if the params of the placeholder net are trainable or not? I assume not since they would be substituted by a trainable param (i.e. the combination of one trianable and another non trainable makes the resulting trainable).

do we really need the del command? this seems rather innefficient…or is it? What do u recommend to make this code efficient?

I guess I was worried that if the original weights were non trianable that putting W_trainable + W_non_trainable would become non trainable…but the del would actually delete the old instance/object so that the requires_grad is set to True.

how do you implement that code you suggested if what I have is the string name of the layer conv0?

perhaps:

setattr(self,f'bn2D_conv{i}',bn)

Use delattr instead of del then

do I really need to delete it before using setattr?

oh no now I can’t because the attributed is ‘conv0.weight’ XD can’t find it…

Do we actually need to delete the attribute though? what goes wrong if I don’t?

Isn’t what we really need to make sure is right that the right variables are inserted in the computation tree for backwards computations to be done right, correct? Can’t be this achieved with setting placeholder net to eval and so the non-trainable net?

The placeholder net is only needed so that forward computation to be done right because it holds the combination.

Ok filed a bug with reproducible code:

code:

import torch
from torch import nn
import torch.optim as optim

import torchvision
import torchvision.transforms as transforms

from collections import OrderedDict

import copy

def dont_train(net):
    '''
    set training parameters to false.
    '''
    for param in net.parameters():
        param.requires_grad = False
    return net

def get_cifar10():
    transform = transforms.Compose(
        [transforms.ToTensor(),
         transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
    trainset = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
    trainloader = torch.utils.data.DataLoader(trainset, batch_size=4,shuffle=True, num_workers=2)
    classes = ('plane', 'car', 'bird', 'cat','deer', 'dog', 'frog', 'horse', 'ship', 'truck')
    return trainloader,classes

def combine_nets(net_train,net_no_train,net_place_holder):
    '''
        Combine nets in a way train net is trainable
    '''
    params_train = net_no_train.named_parameters()
    dict_params_place_holder = dict( net_place_holder.named_parameters() )
    dict_params_no_train = dict(net_train.named_parameters())
    for name,param_train in params_train:
        if name in dict_params_place_holder:
            param_no_train = dict_params_no_train[name]
            delattr(net_place_holder, name)
            W_new = param_train + param_no_train # notice addition is just chosen for the sake of an example
            setattr(net_place_holder, name, W_new)
    return net_place_holder

def combining_nets_lead_to_error():
    '''
    Intention is to only train the net with trainable params.
    Placeholde rnet is a dummy net, it doesn't actually do anything except hold the combination of params and its the
    net that does the forward pass on the data.
    '''
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    ''' create three musketeers '''
    net_train = nn.Sequential(OrderedDict([
          ('conv1', nn.Conv2d(1,20,5)),
          ('relu1', nn.ReLU()),
          ('conv2', nn.Conv2d(20,64,5)),
          ('relu2', nn.ReLU())
        ])).to(device)
    net_no_train = copy.deepcopy(net_train).to(device)
    net_place_holder = copy.deepcopy(net_train).to(device)
    ''' prepare train, hyperparams '''
    trainloader,classes = get_cifar10()
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.SGD(net_train.parameters(), lr=0.001, momentum=0.9)
    ''' train '''
    net_train.train()
    net_no_train.eval()
    net_place_holder.eval()
    for epoch in range(2):  # loop over the dataset multiple times
        running_loss = 0.0
        for i, (inputs, labels) in enumerate(trainloader, 0):
            optimizer.zero_grad() # zero the parameter gradients
            inputs, labels = inputs.to(device), labels.to(device)
            # combine nets
            net_place_holder = combine_nets(net_train,net_no_train,net_place_holder)
            #
            outputs = net_place_holder(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
            # print statistics
            running_loss += loss.item()
            if i % 2000 == 1999:  # print every 2000 mini-batches
                print('[%d, %5d] loss: %.3f' %
                      (epoch + 1, i + 1, running_loss / 2000))
                running_loss = 0.0
    ''' DONE '''
    print('Done \a')

if __name__ == '__main__':
    combining_nets_lead_to_error()

@ptrblck I know this is sort invasive to just tag you here…but I was wondering if you knew how to answer this question? I’ve not been able to have it work for some time and was hoping some pytorch expert had some insight on whats going wrong… :frowning: no pressure though… :slight_smile:

Nested attributes are sometimes hard to handle.
You could change your inner code of combine_nets this as a workaround:

if name in dict_params_place_holder:
    param_no_train = dict_params_no_train[name]
    parent, child = name.split('.')
    delattr(getattr(net_place_holder, parent), child)
    W_new = param_train + param_no_train # notice addition is just chosen for the sake of an example
    setattr(getattr(net_place_holder, parent), child, W_new)

This should solve the error.
However, it seems your input has 3 channels, while your first conv layer just takes 1.
Probably you should change this as well.

1 Like

Actually @ptrblck, I had already fixed that part…my apologies for not updating that part of the code.

Regardless, it seems the training is still not working. What I did to debug it is set trainable to a net with standard initialization and then non-trainable to zero. Then I just computed in the placeholder the net f(W_nontrain+W_train) = f(0+W_train) = f(W_train) which should just train as standard SGD but I don’t get any change in the loss which must mean something is wrong. Did you manage to get it to work?

Thanks so much for giving this a check. I appreciate it :slight_smile:

I’ve debugged your code a bit and it seems that some names are a bit confusing.
In combine_nets it seems are mixing the train/no_train naming scheme:

params_train = net_no_train.named_parameters()
dict_params_no_train = dict(net_train.named_parameters())

However, the other issue is, that your dict_params_place_holder will be empty once we set the new attributes.
Most likely my suggestion to use setattr(getattr(...)) was wrong.
The modules are not registered anymore and thus the function doesn’t do anything at all after the first run.

1 Like

Oh dam thats embarrassing I forgot that I already fixed that issue too (of course after so many days). I feel really bad about that. Definitively requires an apologies…

hmmm…does that mean that what I am trying to do is not possible? Or is there a way to make it work?

Brando, Brando… now you owe me a beer… or two. :wink:

Could you post the most recent code then?