Unigram (bag of words) sentence log-likelihood (Cross Entropy)


I’m trying to compute the unigram (bag of words) log-probability of a batch of sentences and I’m having trouble with the matching between predictions and targets.

To exemplify the problem:
For each sentence, in the batch, I compute a representation of it, which is then used to estimate the probability of each word in the vocabulary being in the corresponding sentence. My targets consist of the ids of the tokens in the sentence, padded to the max sentence length in the batch. E.g.:

[[‘The’, ‘sky’, ‘is’, ‘blue’, ‘.’],
[‘The’, ‘tail’, ‘of’, ‘the’, ‘dog’, ‘.’]]

-1 for padding
[[0, 1, 2, 3, 4, -1],
[0, 5, 6, 0, 7, 4]]

[[0.2, 0.1, 0.15, 0.05, 0.3, 0.05, 0.05, 0.1],
[0.15, 0.25, 0.15, 0.05, 0.05, 0.1, 0.05, 0.2]]


I could always implement my own loss function, using the ‘targets’ as indices of the elements of ‘preds’ that are to be selected (ignoring the pad index), take the log and sum of those (and average over batch). (I’m not actually asking any question in this point. Just as a check in case someone has a better solution :slight_smile:)

But maybe Pytorch already has something like this implemented, some variation of the cross entropy function, but, if so, I haven’t been able to find it.
So, does it exist?

To be honest, I don’t really understand what you’re trying to do. But usually, when you consider text documents as bag of words, one represents documents as a one-hot vector with the length being the size of the vocabulary. You preds seem to reflect this. In contrast, Your sentences and targets look like sequences. I assume you have to write your own solution for whatever you’re trying to do.