Training Two identical models simultaneously doesn't return identical weights/biases?

Hi everyone,

I’m new to PyTorch/ML, so I hope I’m missing something simple. I am trying to create a multi-output neural network for regression, starting with a network with just a simple linear layer from the inputs to the outputs, essentially to see how everything is working. The dimensionality of the model is 8 input features and 4 outputs.
I eventually got into the weeds with this approach, trying to show that if I train two identical models with that architecture, that the weights and biases are the same. I want to track down the sources of randomness within PyTorch so I can fine-tune them once I get to bigger models.

The code essentially goes like this:

torch.manual_seed(0)

#create dataloader from a dataset. It is 1000 samples with 8 features and 4 outputs for each sample.
dataloader = DataLoader(myDataset, batch_size = 500, sampler=utils.data.SequentialSampler(dataset))

#Initialize networks 
#I just noticed I don't send the networks to the 'device', but I'm on CPU for this test.
NN1 = myNeuralNet(IN=8,OUT=4)
NN2 = myNeuralNet(IN=8,OUT=4) 
#there is only one Linear layer in each model with a weight_norm on the outside, so each model should only require one initialization. In other words: nn.weight_norm(nn.Linear(IN,OUT)) is the network.
#There are no activation functions.
nn.init.ones_(NN1.weight)
nn.init.ones_(NN1.bias)
nn.init.ones_(NN2.weight)
nn.init.ones_(NN2.bias)

#I don't think I need an independent loss fcn for each one, but I'm just being safe
loss_fn1 = nn.MSELoss(reduction='sum')
loss_fn2 = nn.MSELoss(reduction='sum')

optimizer1 = optim.Adam(NN1.parameters())
optimizer2 = optim.Adam(NN2.parameters())

#train both models (defined below)
train_epoch(NN1,dataloader,loss_fn1,optimizer1)
train_epoch(NN2,dataloader,loss_fn2,optimizer2)

#after that single step, the networks' weights and biases will not be identical, but they are not

def train_epoch(model,dataloader,loss_fn,optimizer):
     model.train()
     for batch_num, (X,y) in enumerate(dataloader):
          X, y = X.to(device), y.to(device)
          y_hat = model.forward(X)
          loss = loss_fn(y_hat,y)
          optimizer.zero_grad()
          loss.backward()
          optimizer.step()
     return(None)

#Finally, definition of network setup and forward:
class myNeuralNet(nn.Module):
    def __init__(self,IN,OUT):
          self.flatten = nn.Flatten()
          self.layers = nn.Sequential(weight_norm(nn.Linear(IN,OUT,bias=True)))
    def forward(self,x):
        x = self.flatten(x)
        y = self.layers(x)
        return(y)

And that’s the meat and potatoes of what’s happening. The two networks are barebones, and should be the same, but they aren’t. They do eventually converge within 1e-6 of each other, but I’m confused as to why they aren’t exactly the same epoch for epoch. I would assume that two models with the same data, the same initial weights/biases, the same optimizers, and the same loss functions would change their weights and biases same, but they don’t.

What am I misunderstanding?

I also want to add that I perfectly understand that models with a very small difference like this will probably perform the same, but I want to understand the inner workings of PyTorch more before I start making those assumptions. I also understand that a Neural Network without activation functions or hidden layers is probably next to useless, but that is also for the same reason as above.

Thanks in advance!

Your assumptions might be correct if you guarantee to only use deterministic algorithms, which is not the default as you might see a slower execution. Check the Randomness docs for more information and on how to use deterministic algorithms.

Ah, I missed that function in the docs. Unfortunately, after adding use_deterministic_algorithms(True) to my script, the two models still don’t update the same. I even changed the setup slightly so that they are being trained concurrently on the same data from the dataloader (with separate optimizers, loss functions, etc.), but that didn’t work either. With the deterministic algorithms and the defined seed, the weight-updates for the two models is the same every time, even if their resulting weights end up being different.

I traced the issue to this line in my code:

NN1 = myNeuralNet(IN=8,OUT=4)
NN2 = myNeuralNet(IN=8,OUT=4)

If I simply remove the initialization of NN1, then NN2 will have the same weights as NN1 from previous runs. However, if I have both Networks initializes (even if NN1 is not being trained in the loop), NN2 will have a different weight matrix. I thought that initializing the weights and biases to the same numbers would be enough, but is there some other source within nn.Module that I’m not expecting? I haven’t added any non-deterministic functions to myNeuralNet from other packages.
edit: I also forgot that I am calling super().__init__() for the parent class of myNeuralNet, nn.Module. Perhaps there is something in there causing this, despite the deterministic functions?

No, initializing two different model objects will create different parameter sets as these are using random initializations as you have already explained in your previous post. You could re-seed the code, load the state_dict of one model into the other one, or manually initialize the parameters to the same value (as you have done before).

Yeah, those are good suggestions. Thank you for being patient. Something I might have glanced over is the use of weight normalization for the layers. It seems that even when I initialize the Linear layer’s bias and weight, the v and g tensors from weight_norm are not altered. From what I understand, the v and g tensors are actually the subjects of optimization with the final weight matrix just being a product of those tensors.
Should I also initialize weight_v and weight_g in my code? or should I call a separate function hook to compute those vectors after initializing the weights and biases?

Update: It just took initializing the weight_v and weight_g vectors correctly. I implemented this by initializing the linear layer before adding it to the Sequential list. Weight_norm then updates the two tensors based on those initialized values.

That fixes my issue, for now. Thanks :grinning:!