Embedding vs one hot representation

The Pytorch tutorial for seq to seq network and attention uses word embeddings instead of one-hot representations as inputs to the LSTM network.

My question is: doesn’t using embeddings affect the training or performance of the model? Two words (with completely different semantics) may have embeddings with high cosine similarity whereas one-hot representations have zero cosine similarity. How does the network deal with it?

The embeddings are trainable, so if two dissimilar words get similar embeddings, then the embedding layer should learn to separate their embeddings.

Thanks. Won’t it prolong the training though? Since the number of learn-able parameters is much more in this case.

Maybe, but then again maybe improving the embeddings pays off by giving the rest of the model a more useful baseline.

1 Like