Why add 0 to encoder outputs during evaluation phase

I am trying to understand the following line in Translation with a Sequence to Sequence Network and Attention tutorial:
encoder_outputs[ei] = encoder_outputs[ei] + encoder_output[0][0]

encoder_outputs[ei] is just 0s. However, the results can be different with and without the addition of encoder_outputs[ei].

Hi,

I am not familiar with this code, but is it always 0? Or is it accumulating some value inside it?

Thanks for replying. I don’t believe it’s accumulating.

encoder_outputs = Variable(torch.zeros(max_length, encoder.hidden_size))
encoder_outputs = encoder_outputs.cuda() if use_cuda else encoder_outputs

for ei in range(input_length):
    encoder_output, encoder_hidden = encoder(input_variable[ei],
                                             encoder_hidden)
    encoder_outputs[ei] = encoder_outputs[ei] + encoder_output[0][0]

Looking quickly at it I would say it is just a typo. As the training code does not do it.
The only thing that this would potentially change is that it can force encoder_output[0][0] to be copied in the case were it could be done inplace, but I don’t think it is the case here.

yeah. I thought it was a typo too. But, just in case, I checked the evaluation results with and without encoder_outputs[ei]. To my surprise, for some inputs, it made a difference, which is baffling :slight_smile:

Could you try adding some extra prints? check where it comes from?

This is the function to evaluate:

def evaluate(encoder, decoder, sentence, max_length=MAX_LENGTH):
input_variable = variableFromSentence(input_lang, sentence)
input_length = input_variable.size()[0]
encoder_hidden = encoder.initHidden()

encoder_outputs = Variable(torch.zeros(max_length, encoder.hidden_size))
encoder_outputs = encoder_outputs.cuda() if use_cuda else encoder_outputs

for ei in range(input_length):
    encoder_output, encoder_hidden = encoder(input_variable[ei],
                                             encoder_hidden)
    encoder_outputs[ei] = encoder_outputs[ei]+encoder_output[0][0]

decoder_input = Variable(torch.LongTensor([[SOS_token]]))  # SOS
decoder_input = decoder_input.cuda() if use_cuda else decoder_input

decoder_hidden = encoder_hidden

decoded_words = []
decoder_attentions = torch.zeros(max_length, max_length)

for di in range(max_length):
    decoder_output, decoder_hidden, decoder_attention = decoder(
        decoder_input, decoder_hidden, encoder_outputs)
    decoder_attentions[di] = decoder_attention.data
    topv, topi = decoder_output.data.topk(1)
    ni = topi[0][0]
    if ni == EOS_token:
        decoded_words.append('<EOS>')
        break
    else:
        decoded_words.append(output_lang.index2word[ni])

    decoder_input = Variable(torch.LongTensor([[ni]]))
    decoder_input = decoder_input.cuda() if use_cuda else decoder_input

return decoded_words, decoder_attentions[:di + 1]

Here are the results with encoder_outputs[ei]+encoder_output[0][0]:

je suis sobre .
= i m sober .
< i m thorough .

excuse moi .
= i m sorry .
< i m sorry .

desole .
= i m sorry .
< i m sorry about it .

desole !
= i m sorry .
< i m sorry to have so . .

And these are the results with encoder_output[0][0] only:

je suis sobre .
= i m sober .
< i m thorough .

excuse moi .
= i m sorry .
< i m sorry .

desole .
= i m sorry .
< i m sorry about it .

desole !
= i m sorry .
< i m sorry .

Even more puzzling is when I check if the two tensors are equal, i.e.,
a = encoder_outputs[ei]+encoder_output[0][0]
b = encoder_output[0][0]
torch.equal(a,b)

As expected, they are equal for any input :slight_smile: so, I am really baffled now :slight_smile:

It is possible that this is just due to the non-determinism of the evaluation.
Basically these two tensors might be equal (following the floating point standard), but they are actually not by one bit. This could lead to different outputs with the same input especially because the next layers of the network are going to amplify any difference in the input.

that sounds like a good guess. I’ll retrain on the cpu and see if I get the same issue.

you are absolutely right. The reason why predictions differ during eval phase is due to the non-determinism. I confirmed this by evaluating the same inputs several time, and the results vary from run to run.

1 Like