Different on two servers

Has someone met the problem that two servers with same settings by pip install -r packages.txt or conda env export environment.yaml , but the performances of trained networks are very different?

By the way, I use adam.


Running the same code on different machines can give different results.
If you run the code multiple times on a single machine with different random seed, are the performance always the same? Or do they vary as well with the random seed.


I set all seeds to the same value.
I have run the code several times on those two servers with both single gpu mode and two gpus mode. The results showed that the performances are both far away from each other in these two modes.
I wonder if it is caused by adam algorithm?

That can be caused by many things.
If your training is not stable, that can be because of your datas, loss, optimizer, model? Basically anything could be the cause of this :confused:

It’s horrible! I test for long times. Cuda version, nvidia driver version, packages… but nothing works. :frowning:

At the very beginning, I haven’t noticed this. I just thought my modified model didn’t perform well. Then I trained the baseline model on the server and it’s really too far from the baseline performance. After that, I checked the training logs and found that all modified models trained on this server don’t have good performance.

Now I know it’s the problem caused by the environment:rofl:

No I said the oposite ! If when testing on a single machine with multiple random seed gives different performances, the problem is with what you try to do, not with the cuda/pytorch/packages!

The model I used is cloned from GitHub without any modification. On one server I can get the reported performance.

Can you provide small script that gives you different performances when run on the different servers? That will be simpler to see what might be the problem.

You mean this?

def main():

	start_full_time = time.time()
	for epoch in range(1, args.epochs+1):
	   print('This is %d-th epoch' %(epoch))
	   total_train_loss = 0

	   ## training ##
	   for batch_idx, (imgL_crop, imgR_crop, disp_crop_L) in enumerate(TrainImgLoader):
	     start_time = time.time()

	     loss = train(imgL_crop,imgR_crop, disp_crop_L)
	     print('Iter %d training loss = %.3f , time = %.2f' %(batch_idx, loss, time.time() - start_time))
	     total_train_loss += loss
	   print('epoch %d total training loss = %.3f' %(epoch, total_train_loss/len(TrainImgLoader)))

	   savefilename = args.savemodel+'/checkpoint_'+str(epoch)+'.tar'
		    'epoch': epoch,
		    'state_dict': model.state_dict(),
                    'train_loss': total_train_loss/len(TrainImgLoader),
		}, savefilename)

	print('full training time = %.2f HR' %((time.time() - start_full_time)/3600))

	#------------- TEST ------------------------------------------------------------
	total_test_loss = 0
	for batch_idx, (imgL, imgR, disp_L) in enumerate(TestImgLoader):
	       test_loss = test(imgL,imgR, disp_L)
	       print('Iter %d test loss = %.3f' %(batch_idx, test_loss))
	       total_test_loss += test_loss

	print('total test loss = %.3f' %(total_test_loss/len(TestImgLoader)))
	#SAVE test information
	savefilename = args.savemodel+'testinformation.tar'
		    'test_loss': total_test_loss/len(TestImgLoader),
		}, savefilename)

(project github: https://github.com/JiaRenChang/PSMNet/blob/master/main.py)

I forgot to tell you that I’m not the only one suffered from this. One of my colleagues still can’t achieve the reported performance yet.

But that can come from so many things before being a library/hardware problem :smiley:
You might want to reduce the example to the minimum thing that works on one machine and not the other.
You want to rule out things like misconfiguration when you launch the job, data being preprocessed differently from one machine to the other etc.

Good advice! Thanks a lot!

Dear @Mata_Fu,
Did you solve the problem?
I have the same problem.