LSTM text generator repeats same words over and over

Hey!

I built an LSTM for character-level text generation with Pytorch. The model trains well (loss decreases reasonably etc.) but the trained model ends up outputting the last handful of words of the input repeated over and over again. For instance:

I have played around with the hyperparameters a bit, and the problem persists. I’m currently using:

  • Loss function: BCE

  • Optimizer: Adam

  • Learning rate: 0.001

  • Sequence length: 64

  • Batch size: 32

  • Embedding dim: 128

  • Hidden dim: 512

  • LSTM layers: 2

I also tried not always choosing the top choice, but this only introduces incorrect words and doesn’t break the loop. I’ve been looking at countless tutorials, and I can’t quite figure out what I’m doing differently/wrong.

The following is the code for training the model. training_data is one long string and I’m looping over it predicting the next character for each substring of length SEQ_LEN. I’m not sure if my mistake is here or elsewhere but any comment or direction is highly appreciated!

loss_dict = dict()
for e in range(EPOCHS):
    print("------ EPOCH {} OF {} ------".format(e+1, EPOCHS))
    
    lstm.reset_cell()
    
    for i in range(0, DATA_LEN, BATCH_SIZE):
        
        if i % 50000 == 0:
            print(i/float(DATA_LEN))
        
        optimizer.zero_grad()
        
        input_vector = torch.tensor([[
            vocab.get(char, len(vocab)) 
            for char in training_data[i+b:i+b+SEQ_LEN]
        ] for b in range(BATCH_SIZE)])
        
        if USE_CUDA and torch.cuda.is_available():
            input_vector = input_vector.cuda()
        
        output_vector = lstm(input_vector)        
        
        target_vector = torch.zeros(output_vector.shape)
        
        if USE_CUDA and torch.cuda.is_available():
            target_vector = target_vector.cuda()
        
        for b in range(BATCH_SIZE):
            target_vector[b][vocab.get(training_data[i+b+SEQ_LEN])] = 1
        
        error = loss(output_vector, target_vector)
        
        error.backward()
        optimizer.step()
        
        loss_dict[(e, int(i/BATCH_SIZE))] = error.detach().item()

I also tried not always choosing the top choice, but this only introduces incorrect words and doesn’t break the loop. I

Hm, if you want to generate different texts, you should sample randomly (regarding the probability) – is that what you were doing when you said that you are not always choosing the top choice?

Also, you probably want to reset the hidden state after each generated text.

Note a great one, but I have some simple CharRNN here in case that’s useful: https://github.com/rasbt/stat479-deep-learning-ss19/blob/master/L14_intro-rnn/code/char_rnn.ipynb

Thank you! I’m looking through your notebook and will make some changes to my approach (e.g. sample text randomly). I think it’ll be helpful.

What I did regarding top choices is that I defined a threshold (e.g. 0.8 times the max), and sampled randomly between all characters that exceeded the threshold. But since the network just ends up repeating the last words of the input, I think it trains to a point where the gap between 1st choice and 2nd choice is super large…

I have a feeling my problem may have to do with the way I’m feeding the network the input and target during training or something? I’m having a hard time wrapping my head around what, conceptually, can lead to this repetition of the last words of the input…?

I think it trains to a point where the gap between 1st choice and 2nd choice is super large…

Good point, I think this can easily happen.

I have a feeling my problem may have to do with the way I’m feeding the network the input and target during training or something?

Yeah, your issue may already occur during training (as opposed to “inference”). Are you drawing random chunks from your training set? If not, you probably want to do that. It looks like in

    input_vector = torch.tensor([[
        vocab.get(char, len(vocab)) 
        for char in training_data[i+b:i+b+SEQ_LEN]
    ] for b in range(BATCH_SIZE)])

that the network always gets the same inputs during training? For debugging, maybe try to print out the text chunks you are feeding the network to make sure they are not repeating.

Okay, it was actually a stupid mistake I made in producing the characters with the trained model: I got confused with the batch size and assumed that at each step the network would predict an entire batch of new characters when in fact it only predicts a single one… Yikes!

Anyways, thanks for your advice and I’ll see if I can use it to fine tune the results a bit!

glad to hear that you found the issue. And yeah, I find working with text data pretty complicated (compared to image data) – very easy to sneak in some coding errors regarding the sampling of the text.

1 Like

Hi
Can you elaborate please? I have the same problem.
I’m trying to give my model a sentence and then append the model predicted word to end of the sentence, then slide the window and feed back the sentence (from second word to the previously predicted word) to the model and so on… but all i got is a repeated word.

Here is how I prepare data batches:

# get words
words = data.split()
# remove repeated words to get vocab:
vocab = set(words)
print(f'\nvocab size: {len(vocab)}')
# make word dict
word2idx = {w:idx for idx, w in enumerate(vocab)}
idx2word = {idx:w for w, idx in word2idx.items()}

# get data bacthes
sequence_len = 30
batch_size = 50
word_size = 20
num_epochs = 2

data_batches = []
for i in range(len(words) - sequence_len - 1):  # 1 for target word
	data = [word2idx[x] for x in words[i: i + sequence_len]]
	target = word2idx[words[i + sequence_len]]

	data_batches.append([data, target])

num_batchs = np.ceil(len(data_batches) / batch_size).astype(np.int)
random.shuffle(data_batches)
print(f'\nnumber of batches: {num_batchs}')

This is my model:

class MyNet(nn.Module):
	def __init__(self, vocab_size, word_size, sequence_len, batch_size, hidden_dim):
		super().__init__()

		self.seq_len = sequence_len
		self.batch_size = batch_size
		self.hidden_dim = hidden_dim
		self.num_layers = 2
		self.h_0 = None
		self.c_0 = None

		self.encoder = nn.Embedding(vocab_size, word_size)
		self.lstm = nn.LSTM(input_size=word_size, hidden_size=hidden_dim,
							num_layers=self.num_layers, batch_first=True, dropout=0)
		self.linear = nn.Linear(hidden_dim, vocab_size)

	def reset_hidden(self):
		# For initialising/resetting hidden state.
		self.h_0 = torch.zeros(self.num_layers, self.batch_size, self.hidden_dim)
		self.c_0 = torch.zeros(self.num_layers, self.batch_size, self.hidden_dim)

	def forward(self, x):
		encoded = self.encoder(x)
		# print(f'encoded: {encoded.shape}')
		lstm_out, (self.h_0, self.c_0) = self.lstm(encoded)
		# print(f'{lstm_out.shape}, {lstm_out[:, -1, :].shape}')
		# Only take the output from the final timestep
		y_pred = self.linear(lstm_out[:, -1, :])
		
		return y_pred

And this is training process:

# begin training
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = MyNet(len(vocab), word_size, sequence_len, batch_size, 50).to(device)
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.5)


for e in range(num_epochs):
	print(f'\n\nepoch #{e}:\n')
	model.reset_hidden()
	
	for i in range(num_batchs):
		batch = data_batches[i * batch_size: (i+1) * batch_size]
		x = torch.tensor([b[0] for b in batch], device=device)
		y = torch.tensor([b[1] for b in batch], device=device)
		# print(x.shape, y.shape)

		y_pred = model(x)

		loss = criterion(y_pred, y)

		optimizer.zero_grad()
		loss.backward()
		optimizer.step()

		if i % 50 == 0:
			print(f'\tbatch #{i}:\tloss={loss.item():.10f}')

torch.save(model.state_dict(), './model.pth')

Finally this is how I test my model:

# testing model
# choosing a random seqence
sentence = data_batches[np.random.randint(1, 500)][0]
print('\n', [idx2word[idx] for idx in sentence])
model.eval()

# trying to predict next 10 words
for i in range(10):
	# model.reset_hidden()
	x = torch.tensor(sentence[i : sequence_len +  i], device=device)
	out = model(x.view(1, *x.size()))
	out = torch.argmax(out)

	word = idx2word[out.detach().item()]
	print(word)

	sentence.append(out.detach().item())

Using argmax isn’t a good way to select the continuation word. Instead, treat the output of the model as a probability distribution and sample from that. Use np.random.choice with weighted random probabilities.

Is just what you stated about it not usually selecting the top choice exactly you meant while you indicated that you must select arbitrarily respecting the likelihood if you want to produce varied texts. Additionally, you should likely erase the hidden layer following each created phrase. A excellent one, although in case it’s helpful, I also have some straightforward code here: Complex Sentence Generator (Free Unlimited) | SEOToolSystem | SeoToolSystem

Hi, I’m having similar issues here.

I’ve been working on a caption generator: GitHub - kd1510/neural_image_caption: Implementing a ConvNet+LSTM caption net

My network is training correctly I assume and does well on the validation set.

However, when I try to run inference on an image from google (even similar images to those in the training set), my LSTM keeps generating the same word with probability >0.90. Not sure if this is an issue with my actual inference code:

with torch.no_grad():
    im_vec = enc(im.unsqueeze(0))
    pred = dec(im_vec, img_vec=True)
    pred = pred.argmax().item()
    gen_cap.append(ix2word[pred])

    print(gen_cap)

    for _ in range(10):
        logits = dec(torch.tensor(pred).cuda().unsqueeze(0), img_vec=False)
        probs = F.softmax(logits)
        print([ix2word[w] for w in torch.topk(probs, 10).indices.cpu().numpy()[0]])
        print(probs[0][probs.argmax().item()])
        pred = probs.argmax().item()
        gen_cap.append(ix2word[pred])

    print(gen_cap)

Some help here would be very appreciated as I’ve had this happen before when doing more basic language modelling. It might come down to me not understanding fully how to work with the hidden states etc.

This can typically occur when a particular word shows up more than others, making the dataset unbalanced. For instance, let’s suppose the word “the” accounts for 5% of all tokens in your dataset. Yet you have 32,000 possible tokens. So the average should be around 0.003%~ frequency. That means “the” is showing up 1600x the average. You can balance this out using the pos_weights argument. And as others have suggested, getting a random weighted token instead of using argmax().