Is my network learning? Training loss changed but validation accuracy stayed the same

serenayj · February 19, 2021, 5:20pm

I am building a network with an LSTM encoder for sentence embedding and a two layers MLP as a classifier with a Softmax function. The loss is CrossEntropy. My training loss seems to decrease, while the validation accuracy stayed the same. I printed out the classifier output and realized all samples produced the same weights for 5 classes. I have tried tuning the learning rate and changing the optimizer but none of them work. Could anyone tell me what might happen and what I should check next?

My LSTM encoder works fine for other problems, so I don’t think there is a bug in the architecture.

TRAIN EPOCH 0 LOSS 1.6094570651566242 TIME 3.570338030656179
TRAIN EPOCH 1 LOSS 1.6094265852944325 TIME 3.553918214639028
TRAIN EPOCH 2 LOSS 1.6094126903404624 TIME 3.553771432240804
TRAIN EPOCH 3 LOSS 1.6094204059428414 TIME 3.558300856749217
TRAIN EPOCH 4 LOSS 1.6094172122114796 TIME 3.561331601937612
VALIDATING EPOCH 4 ACC 0.19394736842105262
Update save model
TRAIN EPOCH 5 LOSS 1.609414727000867 TIME 3.5560521999994914
TRAIN EPOCH 6 LOSS 1.609416576428602 TIME 3.553495530287425
VALIDATING EPOCH 6 ACC 0.19394736842105262
TRAIN EPOCH 7 LOSS 1.6094154085816637 TIME 3.603946268558502
TRAIN EPOCH 8 LOSS 1.6094178372183763 TIME 3.600656755765279
VALIDATING EPOCH 8 ACC 0.19394736842105262
TRAIN EPOCH 9 LOSS 1.6094167562527846 TIME 3.5670180916786194
TRAIN EPOCH 10 LOSS 1.6094169845688815 TIME 3.558741509914398
VALIDATING EPOCH 10 ACC 0.19394736842105262
TRAIN EPOCH 11 LOSS 1.6094162363117024 TIME 3.572726861635844
TRAIN EPOCH 12 LOSS 1.609419370775169 TIME 3.5575267990430195
VALIDATING EPOCH 12 ACC 0.19394736842105262
TRAIN EPOCH 13 LOSS 1.6094190575982217 TIME 3.5639585892359418
TRAIN EPOCH 14 LOSS 1.6094135887878762 TIME 3.5603323658307393
VALIDATING EPOCH 14 ACC 0.19394736842105262
TRAIN EPOCH 15 LOSS 1.6094177766034832 TIME 3.560251947244008
TRAIN EPOCH 16 LOSS 1.6094157561070501 TIME 3.5713677604993186
VALIDATING EPOCH 16 ACC 0.19394736842105262
TRAIN EPOCH 17 LOSS 1.6094183625474487 TIME 3.5744518558184306
TRAIN EPOCH 18 LOSS 1.6094157695770264 TIME 3.56392617225647
VALIDATING EPOCH 18 ACC 0.19394736842105262
TRAIN EPOCH 19 LOSS 1.6094196570121635 TIME 3.549365504582723
>>> net.out
tensor([[0.1995, 0.1999, 0.1995, 0.2009, 0.2003],
        [0.1995, 0.1999, 0.1995, 0.2009, 0.2003],
        [0.1995, 0.1999, 0.1995, 0.2009, 0.2003],
        [0.1995, 0.1999, 0.1995, 0.2009, 0.2003],
        [0.1995, 0.1999, 0.1995, 0.2009, 0.2003],
        [0.1995, 0.1999, 0.1995, 0.2009, 0.2003],
        [0.1995, 0.1999, 0.1995, 0.2009, 0.2003],
        [0.1995, 0.1999, 0.1995, 0.2009, 0.2003],
        [0.1995, 0.1999, 0.1995, 0.2009, 0.2003],
        [0.1995, 0.1999, 0.1995, 0.2009, 0.2003],
        [0.1995, 0.1999, 0.1995, 0.2009, 0.2003],
        [0.1995, 0.1999, 0.1995, 0.2009, 0.2003],
        [0.1995, 0.1999, 0.1995, 0.2009, 0.2003],
        [0.1995, 0.1999, 0.1995, 0.2009, 0.2003],
        [0.1995, 0.1999, 0.1995, 0.2009, 0.2003],
        [0.1995, 0.1999, 0.1995, 0.2009, 0.2003],
        [0.1995, 0.1999, 0.1995, 0.2009, 0.2003],
        [0.1995, 0.1999, 0.1995, 0.2009, 0.2003],
        [0.1995, 0.1999, 0.1995, 0.2009, 0.2003],
        [0.1995, 0.1999, 0.1995, 0.2009, 0.2003],
        [0.1995, 0.1999, 0.1995, 0.2009, 0.2003],
        [0.1995, 0.1999, 0.1995, 0.2009, 0.2003],
        [0.1995, 0.1999, 0.1995, 0.2009, 0.2003],
        [0.1995, 0.1999, 0.1995, 0.2009, 0.2003],
        [0.1995, 0.1999, 0.1995, 0.2009, 0.2003],
        [0.1995, 0.1999, 0.1995, 0.2009, 0.2003],
        [0.1995, 0.1999, 0.1995, 0.2009, 0.2003],
        [0.1995, 0.1999, 0.1995, 0.2009, 0.2003],
        [0.1995, 0.1999, 0.1995, 0.2009, 0.2003]], device='cuda:0',
       grad_fn=<SoftmaxBackward>)

Dwight_Foster · February 19, 2021, 6:17pm

Could you send your training loop.

serenayj · February 19, 2021, 6:37pm

Thanks for reminding me to post the codes.
BTW, the train/valid split is 6k/1k.

Here is the Trainer definition:

class Trainer(nn.Module):
	def __init__(self, hidden_size, embed_dim, device, optim_cfg, n_layer=2, n_classes=5):
		super(Trainer, self).__init__()
		self.clsf = Clsf(hidden_size, 5).to(device)
		self.encoder = Encoder(hidden_size, embed_dim, device).to(device)
		self.device = device 
		self.cfg = optim_cfg 
		self.loss_fn = nn.CrossEntropyLoss().to(device) 

	def create_optimizer(self):
		self.model_params = list(self.encoder.parameters()) + list(self.clsf.parameters())
		if "optimizer_type" not in self.cfg:
			# default Adam 
			self.optimizer = optim.Adam(self.model_params, 
					   lr=self.cfg['lr'], 
					   weight_decay=self.cfg['weight_decay'])
		else:
			opt_type = self.cfg["optimizer_type"]
			if opt_type == "SGD":
				self.optimizer = torch.optim.SGD(
						self.model_params, lr=self.cfg['lr'],
						momentum=self.cfg.get("momentum", 0.9),
						weight_decay=self.cfg.get("weight_decay", 0.0))
			elif opt_type == "Adadelta":
				self.optimizer = torch.optim.Adadelta(self.model_params, lr=self.cfg['lr'])
			elif opt_type == "RMSprop":
				self.optimizer = torch.optim.RMSprop(self.model_params, lr=self.cfg['lr'])
			else:
				raise NotImplementedError(
					"Not supported optimizer [{}]".format(opt_type))
		print("CREATED optimizer ")

	def forward(self, batch, mode="train"):
		gold_labels, batch_output = self.encoder(batch)
		self.enc_out = batch_output
		self.optimizer.zero_grad() 
		s1s, s2s, s3s, s4s, s5s = [], [], [], [], [] 
		for _b, b in enumerate(batch_output):
			s1, s2, s3, s4, s5 = b 
			s1s.append(torch.mean(s1, dim=1)) 
			s2s.append(torch.mean(s2,dim=1)) 
			s3s.append(torch.mean(s3,dim=1)) 
			s4s.append(torch.mean(s4, dim=1)) 
			s5s.append(torch.mean(s5, dim=1))  

		x1 = torch.cat(s1s) 
		x2 = torch.cat(s2s) 
		x3 = torch.cat(s3s) 
		x4 = torch.cat(s4s) 
		x5 = torch.cat(s5s) 

		self.mlp_in = torch.cat([x1, x1-x2, x1-x3, x1-x4, x1-x5], dim=-1) 
		clsf_out = self.clsf(self.mlp_in) 
		self.out = F.softmax(clsf_out, dim=-1)  
		loss = self.loss_fn(self.out, gold_labels.to(self.device))
		pred = torch.argmax(clsf_out, dim=1)
		correct = (pred == gold_labels.to(device)).sum().item()
		loss.backward()
		self.optimizer.step()
		return loss.item(), correct  

	def validate(self, testdb):
		# eval model 
		#self.eval() 
		ks = list(testdb.all_sent_data.keys())
		accs = 0 
		cnt = 0 
		for k in ks:
			sample =[testdb[k]]
			if None in sample:
				continue 
			data_batch = my_collate_fn(sample, self.device, testdb.wordvecs)
			gold_tensors, pad_adj_matrices, anchors, stories_tensor, stories_vecs, allens  = data_batch
			cnt +=1 
			with torch.no_grad():
				gold_labels, batch_output = self.encoder(data_batch)
				s1s, s2s, s3s, s4s, s5s = [], [], [], [], [] 
				for _b, b in enumerate(batch_output):
					s1, s2, s3, s4, s5 = b 
					s1s.append(torch.mean(s1, dim=1)) 
					s2s.append(torch.mean(s2, dim=1)) 
					s3s.append(torch.mean(s3, dim=1)) 
					s4s.append(torch.mean(s4, dim=1)) 
					s5s.append(torch.mean(s5, dim=1)) 

				x1 = torch.cat(s1s) 
				x2 = torch.cat(s2s) 
				x3 = torch.cat(s3s) 
				x4 = torch.cat(s4s) 
				x5 = torch.cat(s5s) 

				mlp_in = torch.cat([x1, x1-x2, x1-x3, x1-x4, x1-x5], dim=-1) 
				clsf_out = self.clsf(mlp_in) 
				self.out = F.softmax(clsf_out, dim=-1)  
				pred = torch.argmax(self.out).item()
				if pred == gold_tensors.item():
					accs +=1 
		return accs / cnt

And here is my training loop:

data_dir = "../data/"
dset = Dataset(data_dir, "train", "../../glove.6B.100d.txt") 
testdset = Dataset(data_dir, "test", "../../glove.6B.100d.txt") 
device= torch.device("cpu")
bsize = 32
val_every, val_after = 2, 2 
optim_cfg = {"lr":1e-4, "weight_decay":0.9}
net = Trainer(100,100, device,optim_cfg)
net.create_optimizer() 
ks = list(dset.all_sent_data.keys())
bad_batch = json.load(open("bad_batch.json","r")) 
n_iters = int(len(ks) / bsize) 
best_acc = 0 
for ep in range(0,20):
	now = time.time()
	losses = [] 
	accs = [] 
	ks_perm = np.random.permutation(ks) 
	for _iter in range(n_iters):
		batch = [] 
		batch = [dset[k] for k in ks_perm[_iter*bsize : (_iter+1)*bsize] if (None not in dset[k]) and (k not in bad_batch)]
		data_batch = my_collate_fn(batch,device, dset.wordvecs)
		loss, correct = net(data_batch)
		losses.append(loss)
		accs.append(correct)
	end = time.time() 
	mins = (end - now) / 60 
	print("TRAIN EPOCH {} LOSS {} ACC {} TIME {}".format(ep, np.mean(losses),np.mean(accs), mins))
	if (ep > val_after) and (ep % val_every == 0):
		net.eval()
		acc = net.validate(testdset)
		print("VALIDATING EPOCH {} ACC {}".format(ep, acc))
		if acc > best_acc:
			best_acc = acc  
		net.train()

Dwight_Foster · February 19, 2021, 6:54pm

Ok one problem is that you apply a softmax function to the output of the network but crossentropy loss also applies a softmax. Try deleting the softmax function on your output and see if that helps. You can also just move it to after your loss so you can use it for the predictions. You also don’t need to put your loss function on a device.

serenayj · February 19, 2021, 8:08pm

Thanks for catching that! I removed the softmax before my loss function and rerun the training. However, I still don’t think the network is learning, see the loss and accuracies I printed out:

TRAIN EPOCH 0 LOSS 1.6104545269982289 ACC 5.745762711864407 TIME 2.13284482161204
TRAIN EPOCH 1 LOSS 1.6099774817288932 ACC 5.745762711864407 TIME 2.1346530238787333
TRAIN EPOCH 2 LOSS 1.6099520632102664 ACC 5.740112994350283 TIME 2.134521810213725
TRAIN EPOCH 3 LOSS 1.6099088434445656 ACC 5.745762711864407 TIME 2.1318265636761984
TRAIN EPOCH 4 LOSS 1.609805136750647 ACC 5.745762711864407 TIME 2.1283810138702393
VALIDATING EPOCH 4 ACC 0.20394736842105263
UPDATING ACC
TRAIN EPOCH 5 LOSS 1.609727140873839 ACC 5.745762711864407 TIME 2.1273571451505027
TRAIN EPOCH 6 LOSS 1.6096392135835638 ACC 5.745762711864407 TIME 2.12272922595342
VALIDATING EPOCH 6 ACC 0.20394736842105263
TRAIN EPOCH 7 LOSS 1.6095740916365284 ACC 5.740112994350283 TIME 2.134061749776204
TRAIN EPOCH 8 LOSS 1.609515246698412 ACC 5.745762711864407 TIME 2.1415396173795065
VALIDATING EPOCH 8 ACC 0.20394736842105263
TRAIN EPOCH 9 LOSS 1.60947663837907 ACC 5.745762711864407 TIME 2.154924046993256
TRAIN EPOCH 10 LOSS 1.6094451630856357 ACC 6.016949152542373 TIME 2.134658928712209
VALIDATING EPOCH 10 ACC 0.19394736842105262
TRAIN EPOCH 11 LOSS 1.6094372844965446 ACC 6.598870056497175 TIME 2.1410026868184406
TRAIN EPOCH 12 LOSS 1.6094372851700434 ACC 6.598870056497175 TIME 2.1313098430633546
VALIDATING EPOCH 12 ACC 0.19394736842105262
TRAIN EPOCH 13 LOSS 1.609437289211036 ACC 6.598870056497175 TIME 2.12858776251475
TRAIN EPOCH 14 LOSS 1.6094372919050313 ACC 6.593220338983051 TIME 2.1309844811757404
VALIDATING EPOCH 14 ACC 0.19394736842105262
TRAIN EPOCH 15 LOSS 1.609437289211036 ACC 6.604519774011299 TIME 2.13219229777654
TRAIN EPOCH 16 LOSS 1.6094361907344754 ACC 6.333333333333333 TIME 2.131902329126994
VALIDATING EPOCH 16 ACC 0.2113157894736842
UPDATING ACC
TRAIN EPOCH 17 LOSS 1.609422934257378 ACC 6.220338983050848 TIME 2.130883463223775
TRAIN EPOCH 18 LOSS 1.6094214175380557 ACC 6.220338983050848 TIME 2.1380335450172425
VALIDATING EPOCH 18 ACC 0.2113157894736842
TRAIN EPOCH 19 LOSS 1.6094259192041085 ACC 6.220338983050848 TIME 2.1383378585179647

Dwight_Foster · February 19, 2021, 9:09pm

Your loss is going down so the model is learning. It could possibly be that you are not computing the predictions correctly. Delete the softmax and argmax for the predictions and instead just do this

 pred = output.data.max(1, keepdim=True)[1]

and then to determine the number of correct you can just do this

correct += np.sum(np.squeeze(pred.eq(labels.data.view_as(pred))).cpu().numpy())

serenayj · February 20, 2021, 1:54am

Thank you for your suggestion! I changed according to what you suggest,but the validation accuracies stayed the same. At least now I could confirm my autograd is working, and I will look deeper to the network design.

Dwight_Foster · February 20, 2021, 1:44pm

Yes it could just be that your model is not complex enough to correctly predict your dataset. What model and dataset are you using?

serenayj · February 20, 2021, 6:04pm

I am using a standard BiLSTM (2 layers) sentence encoder, and a two-layer MLP. The task is to predict sentence ordering using the dataset in GitHub - ZeweiChu/DiscoEval: EMNLP DiscoEval paper.

Dwight_Foster · February 20, 2021, 6:51pm

Ok your code doesn’t look wrong. I think the problem might be that you just need to run it for longer because as long as the loss is decreasing it should be fine. I think the accuracy on predicting sentence order would not change much until it has trained for a while. Sorry I couldn’t help more.