Unused model parameters affect optimization for Adam

Sorry for being not clear enough.
By PRNG I mean the Pseudorandom Number Generator.
The ordering just matters for the sake of debugging, as we are dealing with pseudorandom numbers.

In order to compare the weights and gradients, we should make sure both models have the same parameters.
One way would be to initialize one model and copy the parameters into the other.
Another way is to seed the PRNG for both models and just sample the same “random” numbers.

You can think about seeding the random number generation as setting a start value. All “random” numbers will be the same after setting the same seed:

torch.manual_seed(2809)
print(torch.randn(5))
> tensor([-2.0748,  0.8152, -1.1281,  0.8386, -0.4471])
print(torch.randn(5))
> tensor([-0.5538, -0.8776, -0.5635,  0.5434, -0.8192])

torch.manual_seed(2809)
print(torch.randn(5))
> tensor([-2.0748,  0.8152, -1.1281,  0.8386, -0.4471])
print(torch.randn(5))
> tensor([-0.5538, -0.8776, -0.5635,  0.5434, -0.8192])

Although we call torch.randn, we get the same “random” numbers in the successive calls.
Now if you add the unused layers before the linear layer, the PRNG will get an additional call to sample the parameters of these layers, which will influence the linear layer parameters.

Ususally, you don’t have to think about these issues. As I said, it’s just to debug your issue.

1 Like