Is the code correct for character level generation in lstm?

I started learning nlp on my own.Intially i started with movie review sentiment classification and it is working fine.Next I started working on text generation from shakespeare text.But it is not training at all.For every epoch i am printing the predicted results and they are same everytime.

@ptrblck

All about dimensions
I took an entire corpus and divided into a fixed length of say 100.Each character is replaced by it’s index in dictionary.So input and target looks like this

input:
[38, 32, 38, 50, 56, 28, 5, 41, 38, 5, 67, 52, 33, 4, 32, 67, 52, 67, 8, 28, 62, 66, 38, 59, 71, 59, 50, 28, 31, 44, 27, 28, 66, 56, 15, 50, 50, 28, 28, 38, 50, 28, 53, 80, 37, 57, 1, 28, 38, 42]
target:
[32, 38, 50, 56, 28, 5, 41, 38, 5, 67, 52, 33, 4, 32, 67, 52, 67, 8, 28, 62, 66, 38, 59, 71, 59, 50, 28, 31, 44, 27, 28, 66, 56, 15, 50, 50, 28, 28, 38, 50, 28, 53, 80, 37, 57, 1, 28, 38, 42, 82]

let us say batch size is32
let us say fixed length is 100

input to model:[32,100]

For enabling to use batches we have to take respective words of each sentence into a row
So applying transpose
then shape becomes [100,32]

Then sending it to nn.Embedding layer(output shape now becomes [100,32,emb_dim]) and again sending it to lstm layer(output becomes [100,32,hidden_dim]).
If you have doubt plz take a look at my model.

Model implementation

class CharRNN(nn.Module):
  def __init__(self,n_hidden=50,n_layers=2,drop_prob=0.5,vocab_dim=50,emb_size=50,batch_size=64,device="cuda"):
      super().__init__()
      self.drop_prob = drop_prob
      self.n_layers = n_layers
      self.n_hidden = n_hidden
      self.batch_size=batch_size
      self.device=device
      self.embedding=nn.Embedding(vocab_dim,emb_size)
      self.lstm=nn.LSTM(emb_size,n_hidden,n_layers,dropout=self.drop_prob)
      
  

  def init_hidden(self):
        """Set initial hidden states."""
        h0 = torch.zeros(
            self.n_layers,
            self.batch_size,
            self.n_hidden,
        )
        c0 = torch.zeros(
            self.n_layers,
            self.batch_size,
            self.n_hidden,
        )
        
        h0 = h0.to(self.device)
        c0 = c0.to(self.device)

        return h0, c0    


  def apply_rnn(self, embedding_out):
        
        activations, (hn,cn) = self.lstm(embedding_out, self.init_hidden())
        
        return activations

  def forward(self, inputs, return_activations=False):
        self.batch_size = len(inputs)
    
       inputs = torch.LongTensor(inputs).to(self.device)
        inputs=inputs.transpose(0,1)
        
        # Get embeddings
        embedding_out = self.embedding(inputs)
        
        activations = self.apply_rnn(embedding_out)
        
        out = torch.sigmoid(activations)

        # Put the output back in correct order
        return out

Now out output is of shape [100,32,emb_dim].Now to apply cross entropy initally my target is of shape [32,100].So firstly i have to transpose it again and permute it.So after transposing shape becomes [32,100,emb_dim] and after permuting the shape becomes [32,emb_dim,100].After that i am finding loss.
Here emb_dim is the output produced and i am using it as one hot vector by sending it through sigmid and finding most probable word in it.

Here is my training code



criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(
    filter(lambda p: p.requires_grad, model.parameters()),
    lr=0.1,
)


def train_epoch(model, optimizer, train_loader):
    model.train()
    total_loss = total = 0
    
    progress_bar = tqdm_notebook(train_loader, desc='Training', leave=False)
    for inputs,targets in progress_bar:
        targets=torch.LongTensor(targets)
        target = targets.to(device)  # targets shape[batch_size,fixed_length]
        #target=target.transpose(0,1)
        # Clean old gradients
        optimizer.zero_grad()
        
        # Forwards pass
        output = model(inputs) # inputs shape [batch_size,fixed_length] outputs shape [fixed_length,batch_size,hidden_dim]
        outputs=output.transpose(0,1) # outputs shape [batch_size,fixed_length,hidden_dim]
        outputs=outputs.permute(0,2,1 # outputs shape [batch_size,hidden_dim,fixed_length])
        loss = nn.CrossEntropyLoss()(outputs, target)

        # Perform gradient descent, backwards pass
        loss.backward()

        # Take a step in the right direction
        optimizer.step()
        #scheduler.step()

        # Record metrics
        total_loss += loss.item()
        total += len(target)
        progress_bar.set_description(
        f'train_loss: {loss:.2e}'
        f'\tavg_loss: {total_loss/total:.2e}\n',
      )

    return total_loss / total


I also printed results during my 9th and 10th epoch.Plz take a look at them and see what went wrong

predicted_text=
c&aeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee

Parameter containing:
tensor([[-0.0848,  1.7742, -0.7627,  ...,  0.2818,  0.3630,  0.6283],
        [ 0.1355, -0.4772,  0.0499,  ...,  0.6572, -1.6990,  0.6295],
        [ 1.3463, -1.8206, -0.1466,  ..., -0.4201,  1.1724, -1.2711],
        ...,
        [ 0.7217,  0.2917, -0.4138,  ..., -0.9130, -0.3257,  0.7373],
        [ 1.9201,  1.0811,  0.0864,  ..., -1.5404, -0.4448, -1.3606],
        [-0.3362,  0.4130,  0.4206,  ...,  1.8701,  1.0428, -0.9026]],
       device='cuda:0', requires_grad=True)
Gradient containing:
tensor([[-6.0252e-05,  2.3833e-05, -9.2437e-05,  ..., -4.4102e-05,
          6.2251e-05,  3.9563e-05],
        [-1.2471e-04,  1.6429e-04, -1.6832e-06,  ..., -4.8308e-05,
          2.9626e-05,  6.5522e-05],
        [-2.6410e-05,  6.6710e-05, -7.8888e-05,  ..., -1.4033e-05,
         -2.6348e-05,  2.2501e-05],
        ...,
        [-8.3676e-06, -4.0126e-05, -5.0642e-05,  ...,  1.2275e-05,
         -3.3493e-05,  1.0394e-06],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
          0.0000e+00,  0.0000e+00],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
          0.0000e+00,  0.0000e+00]], device='cuda:0')
epoch #  9	train_loss: 1.33e-01	valid_loss: 2.17e-01


predicted_text=
c&aeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee

Parameter containing:
tensor([[-0.0848,  1.7740, -0.7626,  ...,  0.2818,  0.3629,  0.6283],
        [ 0.1355, -0.4778,  0.0503,  ...,  0.6571, -1.6992,  0.6297],
        [ 1.3463, -1.8207, -0.1465,  ..., -0.4201,  1.1724, -1.2711],
        ...,
        [ 0.7217,  0.2913, -0.4136,  ..., -0.9129, -0.3258,  0.7374],
        [ 1.9201,  1.0810,  0.0864,  ..., -1.5403, -0.4448, -1.3605],
        [-0.3362,  0.4130,  0.4206,  ...,  1.8701,  1.0428, -0.9026]],
       device='cuda:0', requires_grad=True)
Gradient Containing:
tensor([[-5.8078e-05,  3.2291e-05, -8.6524e-05,  ..., -4.5590e-05,
          7.4338e-05,  3.1892e-05],
        [-1.0505e-04,  1.4412e-04,  7.3077e-06,  ..., -4.6723e-05,
          3.1852e-05,  5.1921e-05],
        [-2.7566e-05,  6.2100e-05, -7.5454e-05,  ..., -1.2124e-05,
         -2.9602e-05,  2.2030e-05],
        ...,
        [-7.7157e-06, -3.8574e-05, -4.5693e-05,  ...,  1.3392e-05,
         -3.1313e-05, -9.7651e-07],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
          0.0000e+00,  0.0000e+00],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
          0.0000e+00,  0.0000e+00]], device='cuda:0')
epoch # 10	train_loss: 1.33e-01	valid_loss: 2.16e-01

Thank you.

Should you be using a nn.Linear layer here instead of sigmoid activation?
I see that CrossEntropyLoss being used. So, simply using the logit outputs from the linear should be fine.

1 Like

Sir. I didn’t get it. So after getting outputs from the lstm I have to pass them through linear layer and then again i have to pass them through sigmoid right??
Before this one i did a movie sentiment classification where it worked fine just by sending hidden outputs.Okay sir i will try that out.
Also sir is my implementation correct?

What was the sentiment classification problem like? Was it a binary classification problem (Positive vs Negative)? In that case, sigmoid activation would work. But this is a multi-class classification. So, my guess is that you shouldn’t be using sigmoid here. Instead, replace it with a Linear layer (size = num_classes, input_dim).

1 Like

Yeah sir.I confused the softmax with sigmoid but cross entropy actually has softmax.So i think i dont need to use that.Thank you sir.My loss is decreasing now but the prediction is not changing at all.Can you please take a look at it.

Updated model definition

class CharRNN(nn.Module):
  def __init__(self,n_hidden=50,n_layers=2,drop_prob=0.5,vocab_dim=50,emb_size=50,batch_size=64,device="cuda"):
      super().__init__()
      self.drop_prob = drop_prob
      self.n_layers = n_layers
      self.n_hidden = n_hidden
      self.batch_size=batch_size
      self.device=device
      self.embedding=nn.Embedding(vocab_dim,emb_size)
      self.lstm=nn.LSTM(emb_size,n_hidden,n_layers,dropout=self.drop_prob)
      self.linear=nn.Linear(n_hidden,n_hidden)
  

  def init_hidden(self):
        """Set initial hidden states."""
        h0 = torch.zeros(
            self.n_layers,
            self.batch_size,
            self.n_hidden,
        )
        c0 = torch.zeros(
            self.n_layers,
            self.batch_size,
            self.n_hidden,
        )
        
        h0 = h0.to(self.device)
        c0 = c0.to(self.device)

        return h0, c0    


  def apply_rnn(self, embedding_out):
        
        activations, (hn,cn) = self.lstm(embedding_out, self.init_hidden())
        
        return activations

  def forward(self, inputs, return_activations=False):
        self.batch_size = len(inputs)
    
        inputs = torch.LongTensor(inputs).to(self.device)
        inputs=inputs.transpose(0,1)
        
        # Get embeddings
        embedding_out = self.embedding(inputs)
        
        activations = self.apply_rnn(embedding_out)
        
        activations=activations.transpose(0,1) # output shape [batchSize,fixed_length,n_hidden]
        
        activations=self.linear(activations)
        

       return activations    

My model results.It is predicting spaces again and again

epoch #  6	train_loss: 1.05e-01	valid_loss: 1.07e-01



prediction: c                                                   

epoch #  7	train_loss: 1.05e-01	valid_loss: 1.07e-01



prediction:c                                                   

epoch #  8	train_loss: 1.05e-01	valid_loss: 1.07e-01



prediction:c                                                   

epoch #  9	train_loss: 1.05e-01	valid_loss: 1.06e-01



predicition:c                                                   

epoch # 10	train_loss: 1.04e-01	valid_loss: 1.06e-01

Code for printing prediction

def print_re(text):
  
  print(text,end=' ')   #printing first character
  model.eval()          #setting to eval mode
  with torch.no_grad():
        # Forwards pass
        tokens=torch.LongTensor([[char2int[text]]]).to(device)  #coonverting  first char to token
        for i in range(0,50):    #i considered fix len as 50
         
          embeddings=model.embedding(tokens)   #accessing layers through (.) operator.Is it correct?? 
                                                                            #output shape[1,1,emb_len]
          
          hidden_output,(h,c)=model.lstm(embeddings,hiddens) # getting output for that single token
                                                                                                 #output shape[1,1,n_hidden]
                                 
          hidden_output=model.linear(hidden_output)
                                                                                               #output shape[1,1,n_hidden]        
  
          
          hidden_output=hidden_output[0].squeeze(0).squeeze(0)
          
        
          prediction = hidden_output.cpu().numpy()
          
          prediction = np.argmax(prediction, axis=0)
          
          print(int2char[prediction],end='')
          
          tokens=torch.LongTensor([[prediction]]).to(device)

It is printing just spaces for prediction.Try to see if there is any problem sir.Thank you

Sir i also tried to print predictions of classes in sorted order and except first one which i gave everytime prediciton is same for nn.linear.Everytime the sorted prediction is same and there are not changing.

predicted character:c

 predicted_classes in sorted order:
[63 76  2 18 59 51 57 13  3 41 10 68 28 82 19 23 46  0 70 54 80 47 11 69
 31 67 32 12 30 35 43 33 39  6 60  7 38 56  5 65 61 42 29 62 22 14  4 21
 44 17 81 36 55 77 53 78 27 74 75 45 16 71 26 48 37 72 49 40 73 25 64 20
 79 34 52 66 58  8 24 15 50  1  9]

output: tensor([[76]], device='cuda:0') #i choose second highest vale but the predicted sorted order is always same

[63 76  2 18 51 57 13 59  3 41 10 68 28 82  0 23 80 19 54 46 31 70 30 67
 39 69 11 12 47 33 35 32 38 61 43  6 56 65 60 42 29  7  5 21 22 16 44 81
 62 14  4 73 40 36 49 17 48 25 78 72 77 71 79 66 75  8 45 15 58 55 26 74
 53 20  1 64  9 50 37 34 27 24 52]

tensor([[76]], device='cuda:0')

[63 76  2 18 51 57 13 59  3 41 10 68 28 82  0 23 80 19 54 46 31 70 30 67
 39 69 11 12 47 33 35 32 38 61 43  6 56 65 60 42 29  7  5 21 22 16 44 81
 62 14  4 73 40 36 49 17 48 25 78 72 77 71 79 66 75  8 45 15 58 55 26 74
 53 20  1 64  9 50 37 34 27 24 52]
tensor([[76]], device='cuda:0')

code for the printing prediction

def print_re(text,hiddens):
  
  print(text,end=' ')
  model.eval()
  with torch.no_grad():
        # Forwards pass
        tokens=torch.LongTensor([[char2int[text]]]).to(device)
        for i in range(0,10):
         
          embeddings=model.embedding(tokens)
          #print(embeddings.shape)
          hidden_output,(h,c)=model.lstm(embeddings,hiddens)
          #print(hidden_output.shape)
          hidden_output=model.linear(hidden_output)
          #print(hidden_output.shape)
          
          #hidden_output = torch.sigmoid(hidden_output)
          
          hidden_output=hidden_output[0].squeeze(0).squeeze(0)
          
        
          prediction = hidden_output.cpu().numpy()
          #print(prediction)
          prediction = np.argsort(-prediction, axis=0)
          print(prediction)
          print(int2char[prediction[1]],end='')
          
          tokens=torch.LongTensor([[prediction[1]]]).to(device)
          print(tokens)

For the training part:

  1. Try using optim.Adam instead of optim.SGD.
  2. Add a nn.dropout (dropout = 0.1) layer before Linear layer in your model.
  3. Assuming you have sufficient high-quality data, make sure training is happening correctly (loss values decrease and converge). Plot the loss values.

For the inference part:

  1. Simply pass the input to model and simply use np.argmax to pick the next best char. You don’t need to use the (.) operators functions for embeddings etc.
1 Like

download

I applied all your suggestions sir but the predictions are same.They are prinitng spaces again and again.

printing code

def print_re(text,hiddens):
  
  print(text,end=' ')
  model.eval()
  with torch.no_grad():
        # Forwards pass
        tokens=[[char2int[text]]]
        #print(tokens)
        for i in range(0,10):
          output = model(tokens)
          #outputs=output.permute(0,2,1)
          #print(output.shape)
          hidden_output=output[0].squeeze(0).squeeze(0)
          
          prediction = hidden_output.cpu().numpy()
          #print(prediction)
          prediction = np.argsort(-prediction, axis=0)
          #print(prediction)
          print(int2char[prediction[0]],end='')
          
          tokens=[[prediction[0]]]
          #print(tokens)

Model code

 def forward(self, inputs, return_activations=False):
        self.batch_size = len(inputs)
    
        inputs = torch.LongTensor(inputs).to(self.device)
        inputs=inputs.transpose(0,1)
        
        # Get embeddings
        embedding_out = self.embedding(inputs)
        
        activations = self.apply_rnn(embedding_out)
        
        activations=activations.transpose(0,1) # output shape [batchSize,fixed_length,emb_length]
        activations=self.dropout(activations)
        activations=self.linear(activations)
        
        #out = torch.sigmoid(activations)

        # Put the output back in correct order
        return activations    

Thank you sir

Most likely your network is not learning. A few more things to try:

  1. Remove the init_hidden in LSTM. Zero initilizers are usually not a good idea.
  2. Maybe your dataset has way too many spaces, hence the model is simply overfitting to spaces.
  3. Try overfitting to a very small dataset (say just 1 batch of 15 chars), and train for a few epochs, to bring the loss value ~ 0 (say < e-05). This is a nice way to test your model is actually learning.
1 Like

Sir i don’t think it can overfit to space bcz for every few chars there will be a single space.So i think it is not a possibility
Sir replacing torch.zeros with torch.randn really helps in prediction.SO it is one of the soution

But i tried various ones.Results are below.I saw your linkedin and your career is so good so i assume you are busy.So Thank you for answering to my queries.

Sir i did it with batch size 1 and 15 chars and it converged to 0 loss. The prediction is also good.

download (1)

After that I tried with batch size 32 and results are again bad.

So this time i tried with 1000 chars and batch size 1 whose length is 30 by removing spaces and results are as follows:

ch #285	train_loss: 8.89e-02	valid_loss: 0.00e+00


predicted_text=cdtmseteseeetehhnaelcibeaeaelce
epoch #286	train_loss: 7.01e-02	valid_loss: 0.00e+00


cieaeteelmteeh,wWshselueaeeeshs
epoch #287	train_loss: 5.59e-02	valid_loss: 0.00e+00


celceSeaehshhshh,tmteshslowhsee
epoch #288	train_loss: 5.67e-02	valid_loss: 0.00e+00


celyslueteaeteshnaesewlueseelse
epoch #289	train_loss: 4.86e-02	valid_loss: 0.00e+00

It didn’t converged to 0.but predicted text is TEXT not repeating a char
download (2)

SO lets try with spaces sir.
So this time i tried with 1000 chars and batch size 1 whose length is 30 by not removing spaces and results are as follows:

predicted_text=yt o oryuaTsy sye o-ssst o eb,
epoch #384	train_loss: 8.89e-02	valid_loss: 0.00e+00


cht syt,t o o eb,t,t s o sso ol
epoch #385	train_loss: 7.21e-02	valid_loss: 0.00e+00


c ot,tret e olebe o olyt,t,t,t 
epoch #386	train_loss: 7.94e-02	valid_loss: 0.00e+00

train_loss: 1.09e-01 avg_loss: 9.11e-02 : 100%


ci o yt solyeb,t,t onle e syeb,
epoch #387	train_loss: 9.11e-02	valid_loss: 0.00e+00

It will converge to 0.but predicted text is TEXT not repeating a char.So i think it is not repeating spaces at all.So i think it is good

download (3)

So i am getting doubt in culprit batch size.So i will increase the batch size to 5
I overfit the data for 750 epochs
Here are the results:

cta madraol mck  t eulkthlTvonh
epoch #496	train_loss: 2.02e-01	valid_loss: 0.00e+00


cii slmal eeta o m yt<EOS>mrs  a  h
epoch #497	train_loss: 1.98e-01	valid_loss: 0.00e+00


cneewntn Bee hew o kl    a o  h
epoch #498	train_loss: 1.96e-01	valid_loss: 0.00e+00


co<EOS>f yniimwy6  chm'hgrted oese 
epoch #499	train_loss: 1.96e-01	valid_loss: 0.00e+00


coadkth at  teadsrs  ho w:ognsk
epoch #500	train_loss: 1.94e-01	valid_loss: 0.00e+00

download (4)

The results are not bad like there is no repeating spaces.For first 10 epochs it is repeating spaces but after 50 epochs it starts to print other chars slowly

So So i will increase the batch size to 20

c     s  Ce <EOS>9f            eoh 
epoch #496	train_loss: 1.33e-01	valid_loss: 0.00e+00


cd  s s s    e      ee         
epoch #497	train_loss: 1.34e-01	valid_loss: 0.00e+00


c     e o  e        ere r   rt 
epoch #498	train_loss: 1.34e-01	valid_loss: 0.00e+00


c-   f          h r e   s e   e
epoch #499	train_loss: 1.33e-01	valid_loss: 0.00e+00


c    e  e    ee ree     th Bae 
epoch #500	train_loss: 1.33e-01	valid_loss: 0.00e+00

For 200 epochs it is printing only spaces but after that it started to print results like above
**Graph is like belowdownload (5) **

Thank you sir.
Conclusions:
1)It is good for batch size 1,5
2)It is okay for batch size 20
3)It is bad like printing spaces again and again for batch size 32

Waiting for your reply sir.

Good to see that the network is learning now.
A few things:

  1. What is your train data like and what output are you expecting (after the model is trained)? Also, how big is the train and valid data sizes?
  2. If you’re sure that the model is training correctly (loss is decreasing, backpropagation happens correctly), you can start playing with the hyperparameters:
    a. Change the dim sizes, eg: n_hidden = 128, emb_size = 128 etc.
    b. Also, for the line below, I think the Linear dims should be (n_hidden, n_classes). Use distinct values to avoid confusion.

c. Modify the learning rate: eg: lr = 0.01 etc.

1 Like

Sir the model is learning pretty good.Thank you so much sir.I will continue from here.Thank you so much sir.