How are module weights initialized in Pytorch ?!


I am trying to compare the performance between two models, one with self-attention layers and other one without. All hyper parameters are being fixed and I am only testing including/excluding the attention layers.

My biggest problem is the weight initialization of convolutional layers. I am using ‘nn.init.xavier_normal_’ for initializing the weights but still suffering from roller-coaster performance from run to run.

How to ensure persistence in initializing the weights of my model such that the difference in performance becomes for sure due to architecture change, not initialization change ?

BTW: I am using 'torch.cuda.manual_seed_all(5) ’ , but without any benefit in terms of persistence.


Setting the seed might not be enough to get exactly the same parameters.
Since one model might have more or other layers than the second one, the PRNG might be called differently.

I would suggest to initialize one model and copy all parameters to the other model. This would make sure that at least all common layers have the same parameters.
Here is a small example.