Compute the validation/testing loss of seq2seq

ShilinHE · January 29, 2018, 2:43pm

Hi, I am working on a seq2seq problem, and I am confused about the validation/testing loss.

In seq2seq, during training, in the decoder, I use the current target token to generate the next token and calculate the loss based on the target sequence and my output sequence. During validation, should I use the current generated token to predict the next token? If so, the loss would be much higher than the training loss, and it cannot decrease. The reason is that when using the generated token to as input for next token, it is not that accurate, and I do not use beam search here. Should I use beam search in the validation? Or should I keep validation phase the same setting with training phase (except those dropout things)?

Thanks! Any response would be appreciated.

wgharbieh · January 29, 2018, 6:29pm

The conditions during validation should mimic test conditions as closely as possible. So it depends on what you are trying to do. Typically in a seq2seq problem, your model has to generate the entire output and therefore it would be better to use the generated token to predict the next token. In some cases, you may just want to check whether your model is actually learning something meaningful (in case your model’s performance is low on validation for example) then you can provide the target token at every time step (this is called teacher forcing) and measure its performance. Using the loss (assuming you are talking about negative log likelihood loss) makes sense when you are using teacher forcing but is not the best way for measuring your model’s performance without teacher forcing because the inputs at every time step may not match the target tokens. Instead, you should use your “metric of interest” during validation without teacher forcing such as BLEU (or METEOR) if you are doing machine translation or ROUGE if you are doing summarization.

ShilinHE · January 30, 2018, 7:36am

Really appreciate for your response. If I understand correctly, actually, you provide two ways: 1. use the teacher forcing for the validation 2. add the “metirc of interest” during validation without teacher forcing. I also wonder that should I control the teacher forcing ratio (some timestamps, use the teacher forcing, some timestamps, use the generated word)? In such case, the model may have better generalization ability. Also, how to handle the dropout/batch normalization in validation phase? Use it or not? Thanks!

wgharbieh · January 30, 2018, 3:42pm

Yes, that is correct. Controlling the teacher forcing ratio - also called scheduled sampling - should help when applied appropriately. There is a paper about it here https://arxiv.org/abs/1506.03099 that shows the improvements that the authors obtain after using it. Keep in mind that scheduled sampling is used during training. Dropout and batch normalization during validation is handled in the same way as during testing.

ShilinHE · January 31, 2018, 9:09am

Thanks!! You really helped me a lot.