Why is there a relu non-linearity only on the decoder embedding in PyTorch Seq2Seq tutorial?

In the PyTorch Seq2Seq tutorial, we see that the decoder’s embedding layer goes through a relu before feeding it to the GRU but for the encoder there isn’t the relu layer.

Why is that so?

What is the purpose of the relu layer in the decoder after the embedding? Why didn’t the encoder’s embedding need a relu or any other non-linearity before feeding it to the GRU?

I don’t see a relu between embedding and GRU. Maybe you can point me to the exact line you are looking at?

Choosing among different non-linearities is an interesting topic. Relu is a popular choice because it often gives good results in practice.

It’s at the line output = F.relu(output) from http://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html

I see it now, thanks! I’m not sure the reason for relu. It probably shouldn’t be there as it basically says anything initialized to <=0 won’t be trained.

1 Like