Why add 0 to encoder outputs during evaluation phase

Irfan_Bulu · March 11, 2018, 2:38pm

I am trying to understand the following line in Translation with a Sequence to Sequence Network and Attention tutorial:
encoder_outputs[ei] = encoder_outputs[ei] + encoder_output[0][0]

encoder_outputs[ei] is just 0s. However, the results can be different with and without the addition of encoder_outputs[ei].

albanD · March 12, 2018, 10:15am

Hi,

I am not familiar with this code, but is it always 0? Or is it accumulating some value inside it?

Irfan_Bulu · March 12, 2018, 11:37am

Thanks for replying. I don’t believe it’s accumulating.

encoder_outputs = Variable(torch.zeros(max_length, encoder.hidden_size))
encoder_outputs = encoder_outputs.cuda() if use_cuda else encoder_outputs

for ei in range(input_length):
    encoder_output, encoder_hidden = encoder(input_variable[ei],
                                             encoder_hidden)
    encoder_outputs[ei] = encoder_outputs[ei] + encoder_output[0][0]

albanD · March 12, 2018, 2:06pm

Looking quickly at it I would say it is just a typo. As the training code does not do it.
The only thing that this would potentially change is that it can force encoder_output[0][0] to be copied in the case were it could be done inplace, but I don’t think it is the case here.

Irfan_Bulu · March 12, 2018, 2:25pm

yeah. I thought it was a typo too. But, just in case, I checked the evaluation results with and without encoder_outputs[ei]. To my surprise, for some inputs, it made a difference, which is baffling

albanD · March 12, 2018, 2:35pm

Could you try adding some extra prints? check where it comes from?

Irfan_Bulu · March 12, 2018, 2:50pm

This is the function to evaluate:

def evaluate(encoder, decoder, sentence, max_length=MAX_LENGTH):
input_variable = variableFromSentence(input_lang, sentence)
input_length = input_variable.size()[0]
encoder_hidden = encoder.initHidden()

encoder_outputs = Variable(torch.zeros(max_length, encoder.hidden_size))
encoder_outputs = encoder_outputs.cuda() if use_cuda else encoder_outputs

for ei in range(input_length):
    encoder_output, encoder_hidden = encoder(input_variable[ei],
                                             encoder_hidden)
    encoder_outputs[ei] = encoder_outputs[ei]+encoder_output[0][0]

decoder_input = Variable(torch.LongTensor([[SOS_token]]))  # SOS
decoder_input = decoder_input.cuda() if use_cuda else decoder_input

decoder_hidden = encoder_hidden

decoded_words = []
decoder_attentions = torch.zeros(max_length, max_length)

for di in range(max_length):
    decoder_output, decoder_hidden, decoder_attention = decoder(
        decoder_input, decoder_hidden, encoder_outputs)
    decoder_attentions[di] = decoder_attention.data
    topv, topi = decoder_output.data.topk(1)
    ni = topi[0][0]
    if ni == EOS_token:
        decoded_words.append('<EOS>')
        break
    else:
        decoded_words.append(output_lang.index2word[ni])

    decoder_input = Variable(torch.LongTensor([[ni]]))
    decoder_input = decoder_input.cuda() if use_cuda else decoder_input

return decoded_words, decoder_attentions[:di + 1]

Here are the results with encoder_outputs[ei]+encoder_output[0][0]:

je suis sobre .
= i m sober .
< i m thorough .

excuse moi .
= i m sorry .
< i m sorry .

desole .
= i m sorry .
< i m sorry about it .

desole !
= i m sorry .
< i m sorry to have so . .

And these are the results with encoder_output[0][0] only:

je suis sobre .
= i m sober .
< i m thorough .

excuse moi .
= i m sorry .
< i m sorry .

desole .
= i m sorry .
< i m sorry about it .

desole !
= i m sorry .
< i m sorry .

Irfan_Bulu · March 12, 2018, 3:02pm

Even more puzzling is when I check if the two tensors are equal, i.e.,
a = encoder_outputs[ei]+encoder_output[0][0]
b = encoder_output[0][0]
torch.equal(a,b)

As expected, they are equal for any input so, I am really baffled now

albanD · March 12, 2018, 3:17pm

It is possible that this is just due to the non-determinism of the evaluation.
Basically these two tensors might be equal (following the floating point standard), but they are actually not by one bit. This could lead to different outputs with the same input especially because the next layers of the network are going to amplify any difference in the input.

Irfan_Bulu · March 12, 2018, 3:22pm

that sounds like a good guess. I’ll retrain on the cpu and see if I get the same issue.

Irfan_Bulu · March 12, 2018, 6:32pm

you are absolutely right. The reason why predictions differ during eval phase is due to the non-determinism. I confirmed this by evaluating the same inputs several time, and the results vary from run to run.