Training accuracy do not increase when train LSTM with batchsize > 1

I am currently trying to train a 3-layer LSTM for a classification task. The input sequence has variable length,so I padded every sequence with zero to the longest one within the minibatch and the padded label is set to -1 which will be ignore in the loss calculation. When I train LSTM with batch_size=1, it works well, the cross entropy loss decreases and the training classification accuracy increases. The problem is when I set batch_size >1, e.g. batch_size=8, the loss decreases while the accuracy do not increase. Could anyone help me to figure out why ?
Some related code is as follows:

class Model(nn.Module):
	def __init__(self, args):
		super(Model, self).__init__()
		self.args = args
		self.n_d = args.feadim
		self.n_cell=args.hidnum
		self.depth = args.depth
		self.drop = nn.Dropout(args.dropout)
		self.n_V = args.statenum
		if args.lstm:
			self.rnn = nn.LSTM(self.n_d, self.n_cell,
				self.depth,
				dropout = args.rnn_dropout,
				batch_first = True
			)
		else:
			pass
		self.output_layer = nn.Linear(self.n_cell, self.n_V)
		
	def forward(self, x, hidden,lens):

		rnnout, hidden = self.rnn(x, hidden)

		output = self.drop(rnnout)
		output = output.view(-1, output.size(2))
		output = self.output_layer(output)
		return output, hidden

def train_model(epoch, model, train_reader):
	model.train()
	args = model.args
	batch_size = args.batch_size
	total_loss = 0.0
	criterion = nn.CrossEntropyLoss(size_average=False,ignore_index=-1)
	hidden = model.init_hidden(batch_size)
	i=0
	running_acc=0
	total_frame=0

	while True:
		feat,label,length = train_reader.load_next_nstreams()
		if length is None or label.shape[0]<args.batch_size:
			break
		else:
			x, y = Variable(torch.from_numpy(feat)).cuda(), Variable(torch.from_numpy(label).long()).cuda()
			hidden = model.init_hidden(batch_size)
			hidden = (Variable(hidden[0].data), Variable(hidden[1].data)) if args.lstm else Variable(hidden.data)

			model.zero_grad()
			output, hidden = model(x, hidden,length)
			assert x.size(0) == batch_size
			loss = criterion(output, y.view(-1))

			_,predict = torch.max(output,1)
			correct = (predict == y).sum()
			loss.backward()
			
			total_loss += loss.data[0]
			running_acc += correct.data[0]
			total_frame += sum(length)

			i+=1
			if i%10 == 0:
			    sys.stdout.write(“time:{}, Epoch={},trbatch={},loss={:.4f},tracc={:.4f}\n”.format(datetime.now(),epoch,i,total_loss/total_frame,
			running_acc*1.0/total_frame))
			sys.stdout.flush()

hi Pan-Zhou,

I am not sure of exactly why you are seeing this behavior. If you pin it down, I would love to know.

Some easy things to try:

  • increase / decrease learning rate, see what happens
  • print out the min/max values of the weights of the network over learning, as well as the norm.
  • check if somehow weights are becoming NaN.

thanks for your advise. I do tuning the learning rate and founds it helps a little. I use about 150 hours speech feathure training a 3 layer lstm with 400 cell each. and set batch_size=4. Here is the log information and weights norm after each epoch.

    Epoch=0  lr=2.0000  train_loss=3.6674  dev_loss=2.9469  tracc=0.0777  validacc=0.0789   [58.3999m]
    Epoch=1  lr=2.0000  train_loss=3.1931  dev_loss=2.8304  tracc=0.0786  validacc=0.0787   [57.8110m]
    Epoch=2  lr=2.0000  train_loss=3.0930  dev_loss=2.7190  tracc=0.0793  validacc=0.0786   [58.2592m]
    
    p_norm: ['7', '23', '0', '0', '23', '23', '0', '0', '23', '23', '0', '0', '37', '0']
    p_norm: ['71', '82', '20', '20', '82', '82', '17', '17', '82', '99', '16', '16', '144', '27']
    p_norm: ['91', '104', '25', '25', '98', '104', '21', '21', '95', '119', '19', '19', '160', '30']
    p_norm: ['97', '110', '27', '27', '106', '118', '22', '22', '102', '129', '19', '19', '166', '31']

In fact I use the same data and the same data io function to train 3 layer LSTM with tensorflow. I works well. traing loss and valid loss are:

End of epoch 0 with avg loss 3.66545295715 and accuracy 0.26187556982
End of epoch 1 with avg loss 2.78404808044 and accuracy 0.355499237776
End of epoch 2 with avg loss 2.55863642693 and accuracy 0.38808375597
End of epoch 3 with avg loss 2.42844891548 and accuracy 0.4079862535
End of epoch 4 with avg loss 2.33932137489 and accuracy 0.422125428915
End of epoch 5 with avg loss 2.20702433586 and accuracy 0.445332825184
End of epoch 6 with avg loss 2.12942314148 and accuracy 0.459314882755
End of epoch 7 with avg loss 2.08610677719 and accuracy 0.467183083296
End of epoch 8 with avg loss 2.06255722046 and accuracy 0.471532851458
End of epoch 9 with avg loss 2.04997444153 and accuracy 0.473860412836

epoch 0 valid split mean loss: 2.96507430077, accuracy: 0.331166476011
epoch 1 valid split mean loss: 2.6755862236, accuracy: 0.3697052598
epoch 2 valid split mean loss: 2.54053473473, accuracy: 0.389888346195
epoch 3 valid split mean loss: 2.47018957138, accuracy: 0.399514913559
epoch 4 valid split mean loss: 2.42790412903, accuracy: 0.40643504262
epoch 5 valid split mean loss: 2.35705971718, accuracy: 0.420234382153
epoch 6 valid split mean loss: 2.32587504387, accuracy: 0.426783770323
epoch 7 valid split mean loss: 2.31113815308, accuracy: 0.42946600914
epoch 8 valid split mean loss: 2.30379247665, accuracy: 0.430867373943
epoch 9 valid split mean loss: 2.29960465431, accuracy: 0.431604236364