If I interchange the learning rate in the autograd example and nn-module example, the loss either decreases very slowly or blows up.
I noticed the initial errors are very different (about 3e7 in autograd vs. 600 in nn-module). I guess this could be the reason for different learning rate settings. But wonder why this happens?
Guess the answer is somewhat apparent, but really couldn’t find out.
Thanks.
Then what weight initialization method does the nn.Module’s linear layer use in default?
And how do you know there is a normal_() method in m.weight.data? I tried to check methods in Variable and nn.Linear and searched docs but can’t find this normal_() method.
variable_or_paramter.data is just a tensor. The normal_ method is located in tensor doc. It makes sense for Variable and Parameter to not have such method as they shouldn’t be modified in-place. If you do, it is (often) impossible to track gradient.