The Pytorch tutorial for seq to seq network and attention uses word embeddings instead of one-hot representations as inputs to the LSTM network.
My question is: doesn’t using embeddings affect the training or performance of the model? Two words (with completely different semantics) may have embeddings with high cosine similarity whereas one-hot representations have zero cosine similarity. How does the network deal with it?