Reasons for different learning rates in pytorch tutorial?

Just curious…Why these two scripts in pytorch tutorial have very different learning rates (1e-6 in autograd vs. 1e-4 in nn-module), when they’re essentially doing the same thing?
I changed torch.nn.Linear(D_in, H) to torch.nn.Linear(D_in, H, bias = False) in the nn-module example, to keep accordance with the no bias setting in the autograd example.

If I interchange the learning rate in the autograd example and nn-module example, the loss either decreases very slowly or blows up.

I noticed the initial errors are very different (about 3e7 in autograd vs. 600 in nn-module). I guess this could be the reason for different learning rate settings. But wonder why this happens?

Guess the answer is somewhat apparent, but really couldn’t find out.

You can get approx. the same result, after initializing the nn.Module's linear layer’s weights using a Gaussian distribution.

Try adding this init method to your code and change the learning rate to 1e-6.

def init_weights(m):
    if type(m) == nn.Linear:

Additionally, keep the bias=False in your linear layers.

1 Like

Then what weight initialization method does the nn.Module’s linear layer use in default?
And how do you know there is a normal_() method in I tried to check methods in Variable and nn.Linear and searched docs but can’t find this normal_() method. is just a tensor. The normal_ method is located in tensor doc. It makes sense for Variable and Parameter to not have such method as they shouldn’t be modified in-place. If you do, it is (often) impossible to track gradient.

OK, Thanks. Then what is the default weight initialization method in nn Modules?

It varies among modules. For example, you can see the one for nn.Linear here

1 Like