I am trying to compare the performance between two models, one with self-attention layers and other one without. All hyper parameters are being fixed and I am only testing including/excluding the attention layers.
My biggest problem is the weight initialization of convolutional layers. I am using ‘nn.init.xavier_normal_’ for initializing the weights but still suffering from roller-coaster performance from run to run.
How to ensure persistence in initializing the weights of my model such that the difference in performance becomes for sure due to architecture change, not initialization change ?
BTW: I am using 'torch.cuda.manual_seed_all(5) ’ , but without any benefit in terms of persistence.