Just curious…Why these two scripts in pytorch tutorial have very different learning rates (1e-6 in autograd vs. 1e-4 in nn-module), when they’re essentially doing the same thing?
torch.nn.Linear(D_in, H) to
torch.nn.Linear(D_in, H, bias = False) in the nn-module example, to keep accordance with the no bias setting in the autograd example.
If I interchange the learning rate in the autograd example and nn-module example, the loss either decreases very slowly or blows up.
I noticed the initial errors are very different (about 3e7 in autograd vs. 600 in nn-module). I guess this could be the reason for different learning rate settings. But wonder why this happens?
Guess the answer is somewhat apparent, but really couldn’t find out.
You can get approx. the same result, after initializing the
nn.Module's linear layer’s weights using a Gaussian distribution.
Try adding this init method to your code and change the learning rate to
if type(m) == nn.Linear:
Additionally, keep the
bias=False in your linear layers.
Then what weight initialization method does the nn.Module’s linear layer use in default?
And how do you know there is a normal_() method in m.weight.data? I tried to check methods in Variable and nn.Linear and searched docs but can’t find this normal_() method.
variable_or_paramter.data is just a tensor. The
normal_ method is located in tensor doc. It makes sense for Variable and Parameter to not have such method as they shouldn’t be modified in-place. If you do, it is (often) impossible to track gradient.
OK, Thanks. Then what is the default weight initialization method in