[Solved] LSTM POS Tagger (with char level features implementation) : No backprop on char LSTM

Daniel_Dsouza · November 15, 2018, 2:54am

Hello, I tried to complete the exercise on the LSTM POS tagger and implemented the char_level features with another LSTM and fed it into the main one by concatenating it to the original word embedding.

The code runs and trains( takes in as input the word+char embedding, but there’s no backprop on the char_lstm side. I verified this by printing some of the weights during the epochs. they remained constant.

@smth @ptrblck There isn’t a clean implementation of it out there to refer to. I was hoping this could be it

Any ideas? Thanks!

Code:

import string
from random import shuffle

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

torch.manual_seed(1)


def prepare_sequence(seq, to_ix):
    idxs = [to_ix[w] for w in seq]
    return torch.LongTensor(idxs)
    
    
training_data = [
    ("The dog ate the apple".split(), ["DET", "NN", "V", "DET", "NN"]),
    ("Everybody read that book".split(), ["NN", "V", "DET", "NN"])
]

# Dictionaries
word_to_ix = {}
tag_to_ix = {}
char_to_ix = {}

# Constants
CHAR_EMBEDDING_DIM = 6
CHAR_HIDDEN_DIM = 4

EMBEDDING_DIM = 6
HIDDEN_DIM = 6

# Computing Word & Tag Dictionaries
for sent, tags in training_data:
    for word in sent:
        if word not in word_to_ix:
            word_to_ix[word] = len(word_to_ix)
    for tag in tags:
        if tag not in tag_to_ix:
            tag_to_ix[tag] = len(tag_to_ix)
            
# Computing the Character Dictionary from the English Alphabet( case sensitive )
allchars = [i for i in string.ascii_lowercase + string.ascii_uppercase]
shuffle(allchars) # To not rely on any inherent ordering 

for char in allchars:
    if char not in char_to_ix:
        char_to_ix[char] = len(char_to_ix)


class char_LSTM(nn.Module):
    '''El Chapo'''
    def __init__(self, char_embedding_dim, char_hidden_dim, charset_size):
        super(char_LSTM, self).__init__()
        
        self.char_hidden_dim = char_hidden_dim
        self.char_embedding = nn.Embedding(charset_size, char_embedding_dim)
        self.lstm = nn.LSTM(char_embedding_dim, char_hidden_dim)
        self.char_hidden = self.init_hidden()
        
    def init_hidden(self):
        ''' Intialize the hidden state'''
        return (torch.rand(1,1,self.char_hidden_dim),
               torch.rand(1,1,self.char_hidden_dim))
    
    def forward(self,single_word):
        ''' Return the final hidden state a.k.a char embedding(This encodes dense character features )'''
        char_embeds = self.char_embedding(single_word)
        _, self.char_hidden = self.lstm(char_embeds.view(len(single_word),1,-1),self.char_hidden)
        self.char_hidden = self.init_hidden()
        return self.char_hidden[0]

class LSTMTagger(nn.Module):
    '''GodFather'''
    def __init__(self, embedding_dim, hidden_dim, char_embedding_dim, char_hidden_dim, vocab_size, tagset_size):
        super(LSTMTagger, self).__init__()
        self.hidden_dim = hidden_dim
        
        self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.char_LSTM_embedding = char_LSTM(char_embedding_dim,char_hidden_dim,len(char_to_ix))
        # note : LSTM input size is embedding_dim+char_hidden_dim to play nicely with concatenation
        self.lstm = nn.LSTM(embedding_dim+char_hidden_dim, hidden_dim)
        self.hidden2tag = nn.Linear(hidden_dim, tagset_size)
        self.hidden = self.init_hidden()
        
    def init_hidden(self):
        ''' Intialize the hidden state'''
        return (torch.rand(1,1,self.hidden_dim),
               torch.randn(1,1,self.hidden_dim))
    
    def concat_embeddings(self,some_word_embedding_tensor, some_character_embedding_tensor):
        ''' Concatenate the word embedding and character embedding into a single tensor. Do this for all words'''
        combo = []
        for w,c in zip(some_word_embedding_tensor,some_character_embedding_tensor):
            combo.append(torch.cat((w,c)))
        return torch.stack(combo)
    
    def forward(self, sentence, sentence_chars):
        word_embeds = self.word_embeddings(sentence)
        char_embeds = []
        for single_word_char in sentence_chars:
            # iterate through each word and append the character embedding to char_embeds
            char_embeds.append(torch.squeeze(self.char_LSTM_embedding(single_word_char)))
        # Concatenate the word embedding with the char embedding( i.e the hidden state from the char_LSTM for each word)
        word_char_embeds = self.concat_embeddings(word_embeds, char_embeds)
        lstm_out, self.hidden = self.lstm(word_char_embeds.view(len(sentence), 1, -1), self.hidden)
        tag_space = self.hidden2tag(lstm_out.view(len(sentence),-1))
        tag_scores = F.log_softmax(tag_space, dim=1)
        return tag_scores

ptrblck · November 15, 2018, 12:38pm

Skimming through your code, I couldn’t find any obvious issues.
Could you check, if the gradients are properly calculated?

print(model.char_LSTM_embedding.lstm.weight_hh_l0.grad)

Daniel_Dsouza · November 15, 2018, 7:24pm

Interesting. So I printed that out after each epoch to get this:

EPOCH: 0 loss 2.3234457969665527
None
EPOCH: 1 loss 2.202385425567627
None
EPOCH: 2 loss 2.2719340324401855
None
EPOCH: 3 loss 2.2589669227600098
None
EPOCH: 4 loss 2.2334115505218506
None
EPOCH: 5 loss 2.183642864227295
None
EPOCH: 6 loss 2.1640632152557373
None
EPOCH: 7 loss 2.1206531524658203
None
EPOCH: 8 loss 2.122248649597168
None

The “weight_hh_l0” definitely has the grad & grad_fn parameters, but they show up as None.

If I just print out model.char_LSTM_embedding.lstm.weight_hh_l0 , I get :

Parameter containing:
tensor([[-0.1937, -0.0163,  0.0048,  0.4731],
        [ 0.4806, -0.1530,  0.1574,  0.2356],
        [ 0.4753, -0.3674, -0.0729, -0.4358],
        [-0.0319,  0.1454, -0.2711,  0.0383],
        [-0.3419, -0.3858,  0.4250, -0.4939],
        [ 0.0769, -0.2238,  0.3595,  0.1102],
        [-0.2952,  0.3039, -0.4631, -0.0418],
        [-0.1510,  0.2949,  0.4648,  0.1492],
        [ 0.2527,  0.0864, -0.1488, -0.1762],
        [ 0.3032, -0.1722,  0.0164, -0.2508],
        [ 0.0289,  0.1599, -0.1730, -0.3345],
        [ 0.2081, -0.4771,  0.2442, -0.1871],
        [-0.3381, -0.4893, -0.3698,  0.1031],
        [-0.1091,  0.4548,  0.1539, -0.2196],
        [-0.0788,  0.0639,  0.3586, -0.1388],
        [ 0.2838, -0.1048, -0.2579,  0.1436]], requires_grad=True)

I wonder why the loss doesn’t backprop through to the char_LSTM module.

Thanks so much for replying btw!

ptrblck · November 15, 2018, 8:00pm

Thanks for the information.
I think I might have missed these lines in char_LSTM's forward before:

_, self.char_hidden = self.lstm(char_embeds.view(len(single_word),1,-1),self.char_hidden)
self.char_hidden = self.init_hidden()

It seems you should change the order of these lines, since currently you are re-initializing the hidden state in each forward after using your lstm.

Daniel_Dsouza · November 15, 2018, 8:24pm

Yes!!

It now shows!

EPOCH: 0 loss 2.245065689086914
tensor([[-3.8246e-04,  2.1555e-04,  1.2193e-04,  2.0082e-04],
        [-2.8758e-04,  1.1619e-04,  1.9071e-04,  7.6792e-05],
        [-3.1537e-05, -1.1242e-05, -1.6944e-05,  5.3930e-05],
        [ 1.9776e-04, -1.0734e-04,  1.3045e-04,  4.6559e-05],
        [-4.7442e-04,  1.5603e-04,  2.1428e-05,  3.2986e-05],
        [-2.8206e-04,  9.5163e-05,  1.2406e-04,  4.7455e-05],
        [ 5.1199e-05,  1.6658e-04,  7.0462e-05,  2.4822e-05],
        [-1.6758e-05,  2.2051e-05,  4.3919e-07,  2.0111e-05],
        [ 2.0732e-04, -6.2676e-04,  2.2783e-04, -1.6364e-04],
        [-7.4985e-04,  6.7677e-04, -2.5997e-04,  1.8289e-04],
        [ 6.8787e-05,  4.0990e-04,  8.4813e-05,  1.0382e-04],
        [-2.0091e-04,  3.8331e-05,  1.9880e-04, -2.5443e-05],
        [-3.0217e-04,  1.7034e-04,  3.4256e-05,  5.5788e-05],
        [-4.9970e-04,  9.8152e-05,  2.3805e-04, -5.7322e-06],
        [-1.7798e-04, -2.1839e-05, -1.3361e-04, -5.3941e-05],
        [ 6.4214e-05, -1.7047e-04,  1.0869e-04, -2.1387e-05]])
EPOCH: 1 loss 2.224614143371582
tensor([[-0.0003,  0.0002,  0.0001,  0.0001],
        [-0.0002,  0.0001,  0.0003,  0.0000],
        [-0.0000, -0.0001, -0.0000,  0.0000],
        [ 0.0001, -0.0001,  0.0001,  0.0001],
        [-0.0004,  0.0002,  0.0001,  0.0000],
        [-0.0003,  0.0001,  0.0001, -0.0000],
        [ 0.0002,  0.0001,  0.0000,  0.0000],
        [-0.0000, -0.0000, -0.0000, -0.0000],
        [ 0.0002, -0.0005,  0.0004, -0.0001],
        [-0.0010,  0.0005, -0.0000,  0.0001],
        [ 0.0004,  0.0002, -0.0001,  0.0001],
        [-0.0004, -0.0000,  0.0007, -0.0002],
        [-0.0002,  0.0001,  0.0000,  0.0000],
        [-0.0005,  0.0001,  0.0003, -0.0001],
        [-0.0002, -0.0000, -0.0000, -0.0001],
        [ 0.0000, -0.0001,  0.0001, -0.0000]])
EPOCH: 2 loss 2.190521478652954
tensor([[-0.0002,  0.0002,  0.0001,  0.0002],
        [-0.0002,  0.0001,  0.0002,  0.0001],
        [ 0.0000, -0.0000, -0.0000,  0.0001],
        [ 0.0001, -0.0001,  0.0001,  0.0000],
        [-0.0005,  0.0002, -0.0000, -0.0001],
        [-0.0003,  0.0001,  0.0001,  0.0000],
        [ 0.0001,  0.0001,  0.0000,  0.0001],
        [-0.0000, -0.0000, -0.0000, -0.0000],
        [ 0.0002, -0.0004,  0.0000, -0.0002],
        [-0.0008,  0.0006, -0.0002,  0.0001],
        [ 0.0004,  0.0003, -0.0001,  0.0003],
        [-0.0005, -0.0001,  0.0006, -0.0002],
        [-0.0002,  0.0001, -0.0000,  0.0000],
        [-0.0005,  0.0001,  0.0002, -0.0000],
        [-0.0001, -0.0000, -0.0001, -0.0001],
        [ 0.0000, -0.0001,  0.0001, -0.0000]])
EPOCH: 3 loss 2.196777105331421
tensor([[-3.5749e-04,  1.3554e-04,  1.8028e-04,  1.6417e-04],
        [-2.2018e-04,  7.7530e-05,  1.7841e-04,  4.2579e-05],
        [-4.6930e-05,  4.5388e-06,  2.5507e-05,  1.2351e-04],
        [ 1.6042e-04, -3.2516e-05,  4.8225e-05,  3.4275e-05],
        [-5.4236e-04,  1.9008e-04,  1.2001e-04,  3.9673e-05],
        [-2.9503e-04,  9.7923e-05,  7.1795e-05,  2.3484e-06],
        [-9.2064e-07,  1.2369e-04,  1.2139e-04,  1.1705e-04],
        [-9.8637e-06, -1.8803e-05, -4.1882e-06, -2.8696e-06],
        [ 3.6163e-04, -4.7896e-04,  1.4343e-04, -2.2040e-04],
        [-8.3896e-04,  5.1561e-04, -2.8873e-05,  2.0739e-04],
        [-1.6798e-04,  3.1492e-04,  2.7706e-04,  2.6071e-04],
        [-4.1065e-04,  5.1932e-05,  2.5442e-04, -5.4233e-05],
        [-2.8720e-04,  1.4935e-04,  5.5944e-05,  5.0228e-05],
        [-4.5585e-04,  8.2353e-05,  2.0688e-04, -3.6975e-05],
        [-1.9315e-04, -1.3623e-05, -1.4761e-04, -1.0804e-05],
        [ 6.2935e-05, -1.1300e-04,  5.0673e-05, -5.8553e-05]])
EPOCH: 4 loss 2.179868459701538
tensor([[-3.5148e-04,  9.1752e-05,  1.4228e-04,  1.7029e-04],
        [-1.0502e-04,  7.8332e-05,  9.0784e-05,  6.4270e-05],
        [-4.9014e-05, -4.5404e-05, -3.8506e-05,  4.5116e-05],
        [ 1.3610e-04, -3.4125e-05, -3.3795e-06, -3.2209e-05],
        [-5.8784e-04,  1.5989e-04,  1.9802e-04,  6.7654e-05],
        [-2.3988e-04,  1.2917e-04,  5.5225e-05,  8.0370e-06],
        [ 1.9398e-04,  1.3447e-04,  1.3801e-04,  5.0111e-05],
        [-3.6221e-06, -1.7676e-05,  3.2276e-05,  1.1954e-05],
        [ 2.1130e-04, -7.2388e-04,  3.2327e-04, -2.5662e-04],
        [-9.7373e-04,  4.7264e-04,  1.7635e-04,  1.9308e-04],
        [ 3.4610e-04,  3.8229e-04, -5.5605e-05,  1.2019e-04],
        [-5.0158e-04,  1.9040e-04,  3.5456e-05,  4.1480e-05],
        [-2.8807e-04,  1.2262e-04,  4.9847e-05,  4.9944e-05],
        [-3.4162e-04,  1.0266e-04,  9.3150e-05, -9.3202e-06],
        [-1.5973e-04,  3.4818e-05, -1.3316e-04, -6.2089e-05],
        [ 6.6439e-05, -1.0870e-04,  5.7728e-05, -7.0024e-05]])

Thanks so much man!

Since we only have 2 sentences in our training, there’s no real way to measure the performance improvement right? Maybe I’ll right a post about it and share it with the community!

Thanks again!

ptrblck · November 15, 2018, 8:26pm

I’m glad it’s working!

Yeah, using just two sentences doesn’t really tell you much about the model, but a blog post explaining your model and your ideas behind it sounds like an awesome idea!