Hi. Hopefully this isn’t off topic, but since my code in is pyTorch I was wondering if someone in this community is able to answer this question.

I’m building a model similar to https://arxiv.org/pdf/1708.02182.pdf, and using perplexity as a way to assess performance. I thought I was calculating it correctly, but my model converges to perplexity levels down in the 1.2 range on wikitext-103 (similar for wikitext-2). In my forward function, I concatenate the last output of by biderectional lstm and pass it through a fully-connected layer:

```
conc = torch.cat((out[-1,:,:self.hidden_dim], out[0,:,self.hidden_dim:]), dim=1)
output = self.dropout(conc)
output = self.fc(output)
return output
```

Then, I use cross entropy loss and scale it to the dimensions of the problem (this is where I might be going wrong?), sum it for all batches and finally take the exponent, here’s the validation loop:

```
loss_func = nn.CrossEntropyLoss()
with torch.no_grad():
for x, y in valid_dl:
if cuda:
x = x.cuda()
y = y.cuda()
preds = model(x)
loss = loss_func(preds.view(-1, preds.size(2)), y.view(-1).long())
val_loss += loss.item() * x.size(0) / x.size(1)
val_loss /= len(valid_dl)
print('Ppl: {:6.2f},'.format( math.exp(val_loss) )
```

I just checked my run and this value has converged to 1.2, should be above 60s.

My data loader is pretty simple and hints at how I build batches:

```
def __getitem__(self, idx):
start = self.itoklist.batch_start_end[idx][0]
end = self.itoklist.batch_start_end[idx][1]
x = self.itoklist.batch_matrix[start:end, :]
y = self.itoklist.batch_matrix[start + 1:end + 1, :]
return x, y.contiguous()
```

Any help would be appreciated.

Thanks.