Computing perplexity

Hi. Hopefully this isn’t off topic, but since my code in is pyTorch I was wondering if someone in this community is able to answer this question.

I’m building a model similar to https://arxiv.org/pdf/1708.02182.pdf, and using perplexity as a way to assess performance. I thought I was calculating it correctly, but my model converges to perplexity levels down in the 1.2 range on wikitext-103 (similar for wikitext-2). In my forward function, I concatenate the last output of by biderectional lstm and pass it through a fully-connected layer:

conc = torch.cat((out[-1,:,:self.hidden_dim], out[0,:,self.hidden_dim:]), dim=1)
output = self.dropout(conc)
output = self.fc(output)
return output

Then, I use cross entropy loss and scale it to the dimensions of the problem (this is where I might be going wrong?), sum it for all batches and finally take the exponent, here’s the validation loop:

       loss_func = nn.CrossEntropyLoss()

       with torch.no_grad():
            for x, y in valid_dl:
                if cuda:
                    x = x.cuda()
                    y = y.cuda()   
                    
                preds = model(x)
                loss = loss_func(preds.view(-1, preds.size(2)), y.view(-1).long())

                val_loss += loss.item() * x.size(0) / x.size(1)

        val_loss /= len(valid_dl)
        
        print('Ppl: {:6.2f},'.format( math.exp(val_loss) )

I just checked my run and this value has converged to 1.2, should be above 60s.

My data loader is pretty simple and hints at how I build batches:

def __getitem__(self, idx):
    start = self.itoklist.batch_start_end[idx][0]
    end = self.itoklist.batch_start_end[idx][1]
    x = self.itoklist.batch_matrix[start:end, :]
    y = self.itoklist.batch_matrix[start + 1:end + 1, :]
    return x, y.contiguous()

Any help would be appreciated.

Thanks.

This is a very old thread, but I’m running into the same problem. Did you ever find your issue?

Any solutions? I’m running into this issue too. It seems the model is trained but the perplexity is not calculated correctly but when I switch into unidirectional LSTM layers perplexity sounds to be a valid value.

So perplexity for unidirectional models is: after feeding c_0 … c_n, the model outputs a probability distribution p over the alphabet and perplexity is exp(-p(c_{n+1}), where we took c_{n+1} from the ground truth, you take and you take the expectation / average over your validation set.

Now you take a bidirectional model, feed ??? and the model outputs a probability distribution ??? and take the probability of ???.

I don’t think the perplexity makes much sense for bidirectional models, because there is no next character when you feed from both directions.
As a practical example, when I last looked fast.ai trained separate forward and backward LMs and then evaluated the perplexity on either.

Somewhat related: https://www.scribendi.ai/can-we-use-bert-as-a-language-model-to-assign-score-of-a-sentence/

Best regards

Thomas

P.S.: I don’t think it’s a good habit to cross post questions to stack overflow without linking that here.

Thanks for your help. I just don’t understand how do we can train separate forward and backward model and evaluate perplexity on both. I have seen some posts and comments in forums that also mentioned about separate training.
When taking bidirectional model we feed the network same input as the unidirectional model, so the input of LSTM is shared between the forward and backward layer then the output of these layers could be summed, average or concatenated to be passed through next layers or final(dense) layer of model that produces output probabilities. We can use these probabilities and the target vector to calculate perplexity. I wanted to asked whether my interpretation is correct?

Thanks.

P.S: Sorry for not referring the link I forgot to do it. I follow the right habit next times.

So if you take the template

Now you take a bidirectional model, feed ??? and the model outputs a probability distribution ??? and take the probability of ???.

and try to make it as precise as the I did for the unidirectional case:

after feeding c_0 … c_n, the model outputs a probability distribution p over the alphabet and perplexity is exp(-p(c_{n+1}), where we took c_{n+1} from the ground truth, you take and you take the expectation / average over your validation set.

you find that it is hard (impossible?) to do without having the target somehow showing up in the input.

So in your version, you put in “the input” and “the target” where you would have to specify what the input is and what the target is. You could try to feed in prefixes of sentences and then predict the next word, but the internal states are quite different to what they would be if you fed in the next word as well. This is in stark contrast to the unidirectional case.

Best regards

Thomas

I did some more research, I found some suggestions to the problem, as @tom explains the problem of training language model with “bidirectional” LSTMs as:

To help this problem it is suggested in this blog post is first to append each sentence with start and end of sentecne tags then instead of concatenating coressponding outputs of each layer, concatenate forward and backawad layers so that they are predicting on the same token.

There is also further explanations in this paper