Hi, I’m a begginer and I’m attempting a student-teacher architecture by initializing two networks with different hidden sizes from the same class. It seems the first network initialization influences the second one ( I get different losses on the student network when initializing the teacher network first, even though I’m training the student network independently of the teacher). I am using BatchNorm after a Linear layer in my network class. So I’m guessing this is what causes the first initialization to influence the second, either the BatchNorm layer or Linear layer is keeping some running statistics from the first initialization. Any ideas how to solve this? Thanks.
Are you saying that the initial loss is different, or the network(s) is/are converging differently when one is initialized after the other. The former is expected as the random state has changed, the latter would be unexpected and shouldn’t be caused by the order of initialization.
If your issue is the former, you can try using torch.manual_seed(seed)
before each initialization to prevent one from “influencing” the other. See also: Reproducibility — PyTorch 2.0 documentation