Same weight parameters given twice to optimizer

Naman-ntc · June 24, 2018, 5:51am

I have the following network

resnet18 = models.resnet18(pretrained=True)
fc_ftrs = resnet18.fc.in_features
resnet18.fc = nn.Linear(fc_ftrs,self.numClasses)

I want to use small learning rate at for the base of my network (finetuning) and different for the fully connected.

If my optimizer is defined as follows :

RMSprop([{resnet18.parameters(), 'lr': 1e-6}, {resnet18.fc.parameters(), 'lr': 5e-4}])

will both learning rates added up for the fully connected layer or will the second override?

tom · June 24, 2018, 10:01am

[Edit: The original suggestion is broken, my apologies, see blow!]
Did you actually try?
Any recent version of PyTorch should give you an error.

You can get rid of those by using

fc_params = list(resnet18.fc.parameters())
other params = [p for p in resnet18.parameters() if p not in fc_params]

or so.

Best regards

Thomas

Naman-ntc · June 24, 2018, 12:32pm

I was actually gonna try it, your method is pretty simple!
Thanks will use this!

Naman-ntc · June 24, 2018, 1:28pm

Hi I tried your approach,
I get the following error

RuntimeError: The size of tensor a (7) must match the size of tensor b (2048) at non-singleton dimension 3

The 7 sized tensor is probably due to the 7x7 convolution in resnet!

Naman-ntc · June 24, 2018, 2:21pm

For now I have resolved it as list(resnet.parameters())[:-2] for the base parameters
Is the any other suggested approach for this!??

tom · June 24, 2018, 6:56pm

Sorry for posting a wrong solution before.
The reason it does not work as expected is because python’s in tries to use ==, and that will not identify tensors.

Using the parameter names will work (you could also hack around it by keeping a set of p.data_ptr() and filter by that, but that is ugly…):

fc_params = [p for n,p in m.named_parameters() if not n.startswith('fc.')]
other_params = [p for n,p in m.named_parameters() if n.startswith('fc.')]

Best regards

Thomas

Naman-ntc · June 25, 2018, 3:13am

Do you think that instead of raising error parameters appear in more than one parameter group if we support overriding the learning rate it would be simpler?
Like for my use case it woulf have been much simpler, ofcourse it creates possibility of mistakes from user side but in my opinion more positive than negative!

tom · June 27, 2018, 12:29pm

To be honest, I think that is is a very special application where you need this and don’t have it conveniently available.
For example, (I think) the fast.ai library (Jeremy Howard advocates a graded learning rate for finetuning) library sticks the various modules in a Sequential module and then gets the parameter groups by iterating over the submodules.
The other option is to use the parameter names, there probably are more elegant solutions than the above if you need it in a systematic way.