Different on two servers

Mata_Fu · November 14, 2018, 2:06pm

Has someone met the problem that two servers with same settings by pip install -r packages.txt or conda env export environment.yaml , but the performances of trained networks are very different?

By the way, I use adam.

albanD · November 14, 2018, 2:22pm

Hi,

Running the same code on different machines can give different results.
If you run the code multiple times on a single machine with different random seed, are the performance always the same? Or do they vary as well with the random seed.

Mata_Fu · November 14, 2018, 2:42pm

Hi,

I set all seeds to the same value.
I have run the code several times on those two servers with both single gpu mode and two gpus mode. The results showed that the performances are both far away from each other in these two modes.
I wonder if it is caused by adam algorithm?

albanD · November 14, 2018, 2:43pm

That can be caused by many things.
If your training is not stable, that can be because of your datas, loss, optimizer, model? Basically anything could be the cause of this

Mata_Fu · November 14, 2018, 2:49pm

It’s horrible! I test for long times. Cuda version, nvidia driver version, packages… but nothing works.

At the very beginning, I haven’t noticed this. I just thought my modified model didn’t perform well. Then I trained the baseline model on the server and it’s really too far from the baseline performance. After that, I checked the training logs and found that all modified models trained on this server don’t have good performance.

Now I know it’s the problem caused by the environment:rofl:

albanD · November 14, 2018, 2:51pm

No I said the oposite ! If when testing on a single machine with multiple random seed gives different performances, the problem is with what you try to do, not with the cuda/pytorch/packages!

Mata_Fu · November 14, 2018, 2:54pm

The model I used is cloned from GitHub without any modification. On one server I can get the reported performance.

albanD · November 14, 2018, 2:58pm

Can you provide small script that gives you different performances when run on the different servers? That will be simpler to see what might be the problem.

Mata_Fu · November 14, 2018, 3:01pm

You mean this?

def main():

	start_full_time = time.time()
	for epoch in range(1, args.epochs+1):
	   print('This is %d-th epoch' %(epoch))
	   total_train_loss = 0
	   adjust_learning_rate(optimizer,epoch)

	   ## training ##
	   for batch_idx, (imgL_crop, imgR_crop, disp_crop_L) in enumerate(TrainImgLoader):
	     start_time = time.time()

	     loss = train(imgL_crop,imgR_crop, disp_crop_L)
	     print('Iter %d training loss = %.3f , time = %.2f' %(batch_idx, loss, time.time() - start_time))
	     total_train_loss += loss
	   print('epoch %d total training loss = %.3f' %(epoch, total_train_loss/len(TrainImgLoader)))

	   #SAVE
	   savefilename = args.savemodel+'/checkpoint_'+str(epoch)+'.tar'
	   torch.save({
		    'epoch': epoch,
		    'state_dict': model.state_dict(),
                    'train_loss': total_train_loss/len(TrainImgLoader),
		}, savefilename)

	print('full training time = %.2f HR' %((time.time() - start_full_time)/3600))

	#------------- TEST ------------------------------------------------------------
	total_test_loss = 0
	for batch_idx, (imgL, imgR, disp_L) in enumerate(TestImgLoader):
	       test_loss = test(imgL,imgR, disp_L)
	       print('Iter %d test loss = %.3f' %(batch_idx, test_loss))
	       total_test_loss += test_loss

	print('total test loss = %.3f' %(total_test_loss/len(TestImgLoader)))
	#----------------------------------------------------------------------------------
	#SAVE test information
	savefilename = args.savemodel+'testinformation.tar'
	torch.save({
		    'test_loss': total_test_loss/len(TestImgLoader),
		}, savefilename)

(project github: https://github.com/JiaRenChang/PSMNet/blob/master/main.py)

Mata_Fu · November 14, 2018, 3:06pm

I forgot to tell you that I’m not the only one suffered from this. One of my colleagues still can’t achieve the reported performance yet.

albanD · November 14, 2018, 3:08pm

But that can come from so many things before being a library/hardware problem
You might want to reduce the example to the minimum thing that works on one machine and not the other.
You want to rule out things like misconfiguration when you launch the job, data being preprocessed differently from one machine to the other etc.

Mata_Fu · November 14, 2018, 3:19pm

Good advice! Thanks a lot!

Hassan_Imani · January 1, 2021, 11:38pm

Dear @Mata_Fu,
Did you solve the problem?
I have the same problem.