Reasons for different learning rates in pytorch tutorial?

charlielam0615 · November 15, 2017, 9:40am

Just curious…Why these two scripts in pytorch tutorial have very different learning rates (1e-6 in autograd vs. 1e-4 in nn-module), when they’re essentially doing the same thing?
http://pytorch.org/tutorials/beginner/pytorch_with_examples.html#autograd
http://pytorch.org/tutorials/beginner/pytorch_with_examples.html#nn-module
I changed torch.nn.Linear(D_in, H) to torch.nn.Linear(D_in, H, bias = False) in the nn-module example, to keep accordance with the no bias setting in the autograd example.

If I interchange the learning rate in the autograd example and nn-module example, the loss either decreases very slowly or blows up.

I noticed the initial errors are very different (about 3e7 in autograd vs. 600 in nn-module). I guess this could be the reason for different learning rate settings. But wonder why this happens?

Guess the answer is somewhat apparent, but really couldn’t find out.

ptrblck · November 15, 2017, 2:03pm

You can get approx. the same result, after initializing the nn.Module's linear layer’s weights using a Gaussian distribution.

Try adding this init method to your code and change the learning rate to 1e-6.

def init_weights(m):
    print(m)
    if type(m) == nn.Linear:
        m.weight.data.normal_()
        print(m.weight)
    
model.apply(init_weights)

Additionally, keep the bias=False in your linear layers.

charlielam0615 · November 16, 2017, 2:20am

Thanks.
Then what weight initialization method does the nn.Module’s linear layer use in default?
And how do you know there is a normal_() method in m.weight.data? I tried to check methods in Variable and nn.Linear and searched docs but can’t find this normal_() method.

SimonW · November 16, 2017, 2:38am

variable_or_paramter.data is just a tensor. The normal_ method is located in tensor doc. It makes sense for Variable and Parameter to not have such method as they shouldn’t be modified in-place. If you do, it is (often) impossible to track gradient.

charlielam0615 · November 16, 2017, 6:35am

OK, Thanks. Then what is the default weight initialization method in nn Modules?

SimonW · November 17, 2017, 2:23am

It varies among modules. For example, you can see the one for nn.Linear here https://github.com/pytorch/pytorch/blob/master/torch/nn/modules/linear.py#L48-L52