what’s the default initialization methods for layers? Like conv, fc, and RNN layers? are they just initialized to all zeros?
All the layer are implemented in this folder: https://github.com/pytorch/pytorch/tree/master/torch/nn/modules
The initialization depend on the layer, for example, the linear one is here
Thank you so much!
I see only torch.Tensor(...)
without any specific initialization method. I wonder what would that be?
In reset_parameters()
the weights are set/reset.
Are these initialization basically He init or Xavier init?
Tanh --> Xavier
ReLU --> He
So Pytorch uses He when its ReLU? Im confused what pytorch does.
Sorry ptrblck, Im confused…pytorch uses Xavier or He depending on the activation? Thats what klory seems to imply but the code looks as follows:
def reset_parameters(self):
stdv = 1. / math.sqrt(self.weight.size(1))
self.weight.data.uniform_(-stdv, stdv)
if self.bias is not None:
self.bias.data.uniform_(-stdv, stdv)
Tanh → Xavier
ReLU → He
No that’s not correct, PyTorch’s initialization is based on the layer type, not the activation function (the layer doesn’t know about the activation upon weight initialization).
For the linear layer, this would be somewhat similar to He initialization, but not quite:
def reset_parameters(self):
stdv = 1. / math.sqrt(self.weight.size(1))
self.weight.data.uniform_(-stdv, stdv)
if self.bias is not None:
self.bias.data.uniform_(-stdv, stdv)
Ie., when I remember correctly, He init is “sqrt(6 / fan_in)” whereas in PyTorch Linear layer it’s “1. / sqrt(fan_in)”
Yeah, you’re correct, I just check their code for linear.py and conv.py, it seems they’re all using Xavier right? (I got the Xavier explanation from here)
are they all using Xavier?
it seems they’re all using Xavier right?
Doesn’t xavier also include fan_out though? Here, I can only see input channels, not output channels.
so is it just a unpublished made up pytorch init?
Maybe I am overlooking sth or don’t recognize it, but I think so
A validation from someone in the pytorch team would be nice calling for call for master @SimonW
no, they are from respective well-established published papers actually. e.g., linear init is from “Efficient Backprop”, LeCun’99.
Thanks! I appreciate the it.
this one I assume:
Maybe it would be worthwhile adding comments to the docstrings? Would make it easier for the next person to find plus more convenient to refer to in papers.