In the PyTorch Seq2Seq tutorial, we see that the decoder’s embedding layer goes through a relu before feeding it to the GRU but for the encoder there isn’t the relu layer.
Why is that so?
What is the purpose of the relu layer in the decoder after the embedding? Why didn’t the encoder’s embedding need a relu or any other non-linearity before feeding it to the GRU?
I see it now, thanks! I’m not sure the reason for relu. It probably shouldn’t be there as it basically says anything initialized to <=0 won’t be trained.