The same seed but different running results on two executions

lizhidan · June 1, 2018, 9:51am

I just written a simple model to classify cifar10 like below method:https://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html#sphx-glr-beginner-transfer-learning-tutorial-py

And I ran it twice,with the same seed:

torch.manual_seed(60)
torch.cuda.manual_seed(60)

and I set dataset loader shuffle=False without any transformer that may include random variable.
Besides, my network parameters are loaded by an existed weight in order to avoid random assignment.

But in training period, with epoch increasing, the differences between the network weights of two executions were more and more obvious (After the first epoch, the differences are just about 1e-9, second 1e-7 and expended Persistently).
Why that happened? Is there any reason like computation error or there are still some random variables ignored? Thanks in advance.

ptrblck · June 1, 2018, 10:26am

If you are using cuDNN, you should set the deterministic behavior.
This might make your code quite slow, but might be a good method to check your code and deactivate it later.

torch.backends.cudnn.deterministic = True

lizhidan · June 1, 2018, 11:34am

Thanks for your reply, but it still can’t solve my problem…

yvanscher · June 1, 2018, 12:01pm

is your weight initialization determined by this random seed? if it’s not that could explain the difference?

lizhidan · June 1, 2018, 1:47pm

emmm I’ve described that my model loaded existed weight trained before so it was fixed without any random factors.

ptrblck · June 1, 2018, 1:57pm

Are you using multiple workers in your DataLoader or any other random functions e.g. from numpy?

lizhidan · June 1, 2018, 2:02pm

Yes I used multiple workers as below:

train_loader        = DataLoader(cifar10.Cifar10(mode='train',  dataset_size=DATASET_SIZE, binclassify=None), shuffle=False, batch_size=BATCH_SIZE, num_workers=BATCH_SIZE)
test_loader         = DataLoader(cifar10.Cifar10(mode='test', dataset_size=DATASET_SIZE, binclassify=None), shuffle=False, batch_size=BATCH_SIZE, num_workers=BATCH_SIZE)
validation_loader   = DataLoader(cifar10.Cifar10(mode='validation', dataset_size=DATASET_SIZE, binclassify=None), shuffle=False, batch_size=BATCH_SIZE, num_workers=BATCH_SIZE)

But I didn’t find any place any other random functions being used.

ptrblck · June 1, 2018, 2:15pm

Could you just for the debugging purpose set num_workers=1 and see if the first few iterations differ in a similar way?

Deeply · June 1, 2018, 2:42pm

Clearly, these randomizers do not generate the same sequence; here’s an example:

torch.cuda.manual_seed(60)

torch.cuda.FloatTensor(1).normal_()
Out[94]: tensor([ 0.7700], device='cuda:0')

torch.cuda.FloatTensor(1).normal_()
Out[95]: tensor([ 0.5048], device='cuda:0')

torch.cuda.manual_seed(60)

torch.cuda.FloatTensor(1).normal_()
Out[97]: tensor([ 0.7700], device='cuda:0')

torch.manual_seed(60)
Out[98]: <torch._C.Generator at 0x7f537c0320b0>

torch.randn(1)
Out[99]: tensor([ 0.7534])

torch.randn(1)
Out[100]: tensor([ 1.8541])

torch.manual_seed(60)
Out[101]: <torch._C.Generator at 0x7f537c0320b0>

torch.randn(1)
Out[102]: tensor([ 0.7534])

Or, I am missing something here?

ptrblck · June 1, 2018, 2:47pm

The seeds work for the CPU and GPU separately, but cannot generate the same random numbers for CPU and GPU.
torch.manual_seed(SEED) will also seed the GPU, but the PRNG used on the GPU and CPU are different. The code should yield deterministic results nevertheless running on the specified device. As far as I know, I is currently not possible to get the same random numbers on different devices. Probably it’s comparable to the seeding in PyTorch vs. numpy. Both will yield deterministic results, but not the same numbers.

hughperkins · June 1, 2018, 9:25pm

@Deeply Yes. I too would prefer that the PRNG is consistent between CPU and GPU; as I allude to in Best practices for seeding random numbers on gpu?

lizhidan · June 2, 2018, 1:42am

It still cannot solve it.
Are there any wrong in me check process described below?
I ran my code twice on GPU, after optimizer.step() I used torch.save(model[1]._parameters['weight'].cpu().data, 'w1') to save the weight to my disk (the second time changes w1 to w2). Then I loaded the two weights using torch.load('w1') and torch.load('w2'), subtracting them and check if the results are all 0.

ptrblck · June 2, 2018, 8:57am

Looks good to me. To search for the problematic part, could you repeat this procedure with random tensors as input, i.e. don’t use your Dataset and DataLoader?
Since you are seeding, the random tensor should be the same in each run.

EnCI · May 7, 2020, 11:16am

Hi! I’m very new with trying to learn pytorch. When trying to compare models i found that the outputs are different.
Simple CNN:

class NetOne(nn.Module):
def init(self):
super(NetOne, self).init()
self.c1 = 16
self.c2 = 8
self.c3 = 4
self.size = 32
self.fclen1=419
self.fclen2=10
self.conv1 = nn.Conv2d(3, self.c1, 3, padding=1)
self.conv2 = nn.Conv2d(self.c1, self.c2, 3, padding=1)
self.conv3 = nn.Conv2d(self.c2, self.c3, 3, padding=1)
self.bn1 = nn.BatchNorm2d(self.c1)
self.bn2 = nn.BatchNorm2d(self.c2)
self.bn3 = nn.BatchNorm2d(self.c3)
self.fc1 = nn.Linear(self.c3 * self.size * self.size, self.fclen2)
self.fc2 = nn.Linear(self.fclen1, self.fclen2)
self.dropout = nn.Dropout(0.25)

def forward(self, x):
    x = F.relu(self.bn1(self.conv1(x)))
    x = self.dropout(x)
    x = F.relu(self.bn2(self.conv2(x)))
    x = F.relu(self.bn3(self.conv3(x)))
    x = x.view(-1, self.c3 * self.size * self.size)
    x = self.fc1(x)
    return x

And using it:

   model1 = NetOne()
   model2 = NetOne()

…
for _, (data, target) in enumerate(train_loader):
if train_on_gpu:
data1, target1 = data.cuda(), target.cuda()
data2, target2 = data.cuda(), target.cuda()
else :
data1, target1 = data, target
data2, target2 = data, target
optimizer1.zero_grad()
optimizer2.zero_grad()
output1 = model1(data1)
output2 = model2(data2)
for i in range (len(s1)):
if s1[i] != s2[i]:
print(‘loss error %d : %f %f’ % (iterNum, s1[i], s2[i]))

gives a lot of lines like:

loss error 0 : -0.132742 0.350880
loss error 1 : -26.758894 -30.239246
loss error 2 : -10.001531 -12.009613

Can you suggest what’s wrong, please.

ptrblck · May 8, 2020, 3:42am

Based on your posted code you are initializing two models randomly, so the results are expected to be different.
To get the same results, you should either set the seed before creating an instance of the model or load the state_dict from one model into the other (which I would recommend).
Also, since the models are using dropout layers, you would have to call model.eval() on them to disable it.