Character language model spits out gibberish

I followed karpathy’s tutorial to build a character level model using an RNN. However, the loss doesn’t seem to decrease. And after training, the model spits out gibberish. Here is the code for the model.

class LSTMWriter(N.Module):
    
    def __init__(self,vocab_size,n_layers=1):
        super(LSTMWriter,self).__init__()
        self.vocab_size=vocab_size
        self.n_layers=n_layers
        self.embedding=N.Embedding(self.vocab_size,10)
        self.lstm=N.GRU(10,10,n_layers,batch_first=True)
        self.dropout=N.Dropout(0.1)
        self.linear=N.Linear(10,self.vocab_size)
        self.linear2=N.Linear(self.vocab_size,self.vocab_size)

    def init_hidden(self,batch_size=1):
        h=A.Variable(torch.zeros(self.n_layers,batch_size,10))
        c=A.Variable(torch.zeros(self.n_layers,batch_size,10))
        return h
        
    def forward(self,sequence,hidden):
        embedding=self.embedding(sequence)
        recurrent=embedding
        for i in range(self.n_layers):
            recurrent,hidden=self.lstm(recurrent,hidden)
        flattened=self.linear(recurrent.contiguous().view(recurrent.size(0)*recurrent.size(1),recurrent.size(2)))
        #flattened=self.dropout(flattened)
        flat=flattened.view(recurrent.size(0),recurrent.size(1),flattened.size(1))
        return flat,hidden

The code for training is:

    dataset=Dataset(storage["raw_data"],storage["word_dict"])
    vocab_size=len(storage["word_dict"])
    word_dict=storage["word_dict"]
    rev_dict=storage["rev_dict"]
    del storage
    data=dataset.get_dataset()
    loss_fn=N.CrossEntropyLoss()

    model=LSTMWriter(vocab_size,4)
    optimizer=O.Adam(model.parameters(),lr=0.1)

    for i in range(10):
        hidden=model.init_hidden(1000)
        batch_generator=get_batches(data[:,:200])
        total_loss=0
        b=1
        while b:
            model.zero_grad()
            try:
                _X,_y=next(batch_generator)
            except:
                break
            train=A.Variable(torch.from_numpy(_X))
            targets=A.Variable(torch.from_numpy(_y).contiguous().view(-1))
            out,hidden=model.forward(train,A.Variable(hidden.data))
            loss=loss_fn(out.contiguous().view(-1,vocab_size),targets)
            loss.backward()
            optimizer.step()
            total_loss+=loss.data[0]
            if b % 10 ==0:
                logging.info("Epoch :{} Batches :{} Loss :{}".format(i,b,total_loss))
                total_loss=0
            b+=1
    generate_text(data[0][3:3+10],model)

And for text generation is:

def generate_text(X,model):
    hidden=model.init_hidden(1)
    input_=numpy.array([X])
    _,hidden=model.forward(A.Variable(torch.from_numpy(input_)),hidden)
    gen_str=""
    for i in range(100):
        out,hidden=model.forward(A.Variable(torch.from_numpy(input_)),hidden)
        out=out[:,-1].data.exp()
        char=torch.multinomial(out,1)[0][0]
        gen_str+=rev_dict[char]
        input_=numpy.append(input_.squeeze(),char)
        input_=numpy.array([input_[1:]],dtype=numpy.long)
    print(gen_str)

After training for a considerable time, the output is complete gibberish

ft t t ttelhnFteuml aasd n nrms eoc eivwcwi td do i ,yactwobe ipn9 lethihy hep to tgeifrroiov

Any Ideas what might be going wrong?

I think the GRU module does the loop in n_layers for you. So I think your network is doing n*n loops, which might be causing the problem.

Hey David, thanks for pointing that out. It was one of the problems. The error steadily decreases now. However, after a few epochs, it goes back up and plateaus where it started off. Any idea why it does that?

I started playing with this this afternoon and had a similar problem. In the end, I added an activation to the final linear layer and it seems to learn.

Hey David. Thanks for looking into it. I have been trying a few things too. So I applied a softmax activation to the outputs. And still no luck. The loss does decrease but again starts increasing even before the first epoch is complete. I tried stopping the model at the lowest loss and had it generate text from a seed in the training, and it spits out complete gibberish.

This is what i tried. It’s also mostly gibberish, but perhaps I need to train it longer, because the loss is going down although now very slowly

class SimpleGRU(nn.Module):
    def __init__(self, vocab_size, emb_size, hid_size, batch_size, seq_len, n_layers=1):
        super(SimpleGRU, self).__init__()
        self.vocab_size = vocab_size
        self.emb_size = emb_size
        self.hid_size = hid_size
        self.n_layers = n_layers
        self.batch_size = batch_size
        self.seq_len = seq_len
        self.emb = nn.Embedding(vocab_size, emb_size)
        self.batchnorm = nn.BatchNorm1d(emb_size-1)
        self.gru = nn.GRU(emb_size, hid_size, batch_first=True)
        self.fc1 = nn.Linear(seq_len * hid_size, vocab_size)
        self.selu = nn.SELU()
        self.logsoftmax = nn.LogSoftmax()
    def forward(self, input, hidden):
        x = self.emb(input)
        x = self.batchnorm(x)
        x, hidden = self.gru(x, hidden)
        x = x.contiguous().view(self.batch_size, -1)
        x = self.selu(self.fc1(x))
        x = self.logsoftmax(x)
        return x, hidden

I’ll run that model and see. Any specific reason why you applied two activation functions to a single layer? The last layer has a SELU and a log softmax together without a layer in between those two activations. Also, you are applying logsoftmax to a layer which contains previous timesteps stacked horizontally. For instance, consider the sequence [1,2,3,4]. Now if I put in 1,2,3, we train the RNN to output 2,3,4. But in your case, since x=x.contiguous().view(self.batch,-1) , you are stacking the timesteps which contain 2,3,4 horizontally and then applying log softmax. So the RNN will be forced to chose just one of the character, it will try to learn 4 since that is the label in the dataset. I on the other hand transform it in this way,
x=x.contiguous().view(batch_size*seqlen,vocab_size).
This way, I stack 2,3,4 vertically and if I apply softmax, I get three different outputs across three different time steps. And then I can train the RNN to output 2,3,4 in that order depending on the input.

no good reason, I was just playing around with stuff.

So I finally got it working. I had to remove the activation on the output layer and the loss decreases like expected. But, it was just a quick fix. Doesn’t make any sense. Assuming a multinomial distribution of characters, there has to be a softmax at the end to output a distribution to sample from. And I had to remove the softmax to get it to work. This is just bizarre.

@adityashinde1506 i dont think it’s bizzare. CrossEntropyLoss includes the Softmax computation (see the docs)

Ah… Makes sense. Thanks for letting me know.