Loss for output sequence of variable length

I’m working on a project where I want to train a model to predict a sequence of tokens. For a given input sequence, the size of the output sequence can be between 1 and 4 tokens. As the size is variable, I use an “empty” token, here index 0, as padding.

The following lists show four sequences of length 3, 1, 2, 4 padded with 0s so that all sequences have final length of four.

[[5, 1, 4, 0,], [2, 0, 0, 0], [4, 3, 0, 0], [2, 1, 3, 8]]

I’m using the CrossEntropyLoss where I set ignore_index=0. I’m wondering if this is correct, as there are no error signals coming from the padded regions.

I would like to know how I use the CrossEntropyLoss correctly for sequences of variable length that are padded.

Hi kfshr!

Use -100 as your “empty” token. This is CrossEntropyLoss's default
for its ignore_index constructor argument. (If you prefer a different value
for some reason, you can use ignore_index explicitly.)

Items in your target (your ground-truth labels) that have a value of -100
will be ignored in the loss computation.


K. Frank

Why is -100 the default “empty” token? Why not -1 or some other value? Is there some reason? (I’m just curious).

I’m wondering if your answer is a “yes” to my question? I’m aware that I can use ignore_index to ignore items in the loss computation. I’m just wondering if this is the right approach for problems with variable output length? :slight_smile:

Hi kfshr!

Sorry, I misread your original post.

Just speculation: -100 looks sufficiently out of place as an integer
categorical class label that one would likely suspect that it is some
sort of sentinel value. You wouldn’t want to use, say, 17 or 99 or even
999 as a sentinel value as these could be legitimate class labels for
plausible use cases. -1 would not typically look like a class label (and
would not be valid for pytorch’s CrossEntropyLoss), but you could
imagine a class of bugs that might give you -1 by accident, and if -1
were the default for ignore_index, such an error would not be caught.

(Your choice of ignore_index = 0 is acceptable and logically valid.
I would, however, recommend not using it, both for robustness (see
above) and stylistic (e.g., stick with the default) reasons.)

I think that the answer to your question could well be “no” (but it depends
on your specific use case).

In your original post you say “For a given input sequence, the size of the
output sequence can be between 1 and 4 tokens.”

This makes it sound like “for a given input sequence” your network needs
to “learn” what length its predicted output sequence should be.

If this is your use case, I would use 0 as a valid class label (that is, not
the ignore_index value).

Let’s say that your ground-truth label for the output sequence is
[2, 0, 0, 0], meaning that the correct sequence is of length one and
the first and only token is 2. A prediction of [2, 0, 0, 0] would be spot
on, while a prediction of [2, 1, 3, 8] would be incorrect because it
predicts a sequence length of four rather than one.

If you use 0 as a valid class label that means “placeholder for short
sequence,” such a prediction will be penalized (which you most likely
want). If, instead, 0 is used as the ignore_index value, the loss values
for the prediction [2, 0, 0, 0] and the prediction [2, 1, 3, 8] will be
the same (because the last three predicted sequence tokens are ignored),
which is most likely not what you want, because the incorrect prediction for
the sequence length made by [2, 1, 3, 8] will not be penalized.

If this doesn’t make sense for your use case, could you give a little more
detail about it, in particular whether and how your loss function should
distinguish between correct and incorrect sequence-length predictions?


K. Frank

This makes perfect sense. Using ignore_index doesn’t make sense after reading your thoughts on this topic. Thank you!