Recover from checkpoint, accuracy will decrease about 3-5 percent?

dragen · January 27, 2018, 7:19am

Hi, all. I have tries both torch.save(network, filename) or torch.save(network.state_dict(), filename). Both of the two approach will have the same problem. when I saved the best accuracy checkpoint in training and recover it again, the recovered network always decrease by 3-5 percent, and It need some hours to retain the same accuracy again. On some condition, it will never reach the best performance again. I felt disappointed!
What’s wrong? Thanks.

SimonW · January 27, 2018, 8:44am

interesting, what kind of layers does your model contain?

ptrblck · January 27, 2018, 9:55am

Do you load the best checkpoint and try to continue the training?
If so, what optimizer are you using and how are you handling it?
Are you recreating the optimizer or saving/loading like your model?
Some optimizers have internal states (e.g. per-parameter lr, etc.), which will be reinitialized when recreated, which might explain the issues.

dragen · January 29, 2018, 5:47am

Thanks for your help. This is my simplified source code.
My network contains conv2d, batchnorm2d, batchnorm1d, relu, sigmoid and linear layers.

I load the best chkpt since I only save the best ckpt.
I test it firstly before backprob in main loop.
I recreate my optimizer and network and load_state_dict
I do network.eval() before testing.
please see the following code to find my problems.


	net = Naive5(n_way, k_shot, imgsz).cuda() 

	if os.path.exists(mdl_file):
		print('load checkpoint ...', mdl_file)
		net.load_state_dict(torch.load(mdl_file), strict=False)
	else:
		print('training from scratch.')
 
	optimizer = optim.Adam(net.parameters(), lr=lr) 
 
 

		for step, batch in enumerate(db):

			accuracy = test_accuracy()
			print('<<<<>>>>accuracy:', accuracy, 'best accuracy:', best_accuracy)
 

			# 2. train
			support_x = Variable(batch[0]).cuda()
			support_y = Variable(batch[1]).cuda()
			query_x = Variable(batch[2]).cuda()
			query_y = Variable(batch[3]).cuda()

			net.train()
			loss = net(support_x, support_y, query_x, query_y)
			total_train_loss += loss.data[0]

			optimizer.zero_grad()
			loss.backward()
			optimizer.step()