nn.Embedding and one-hot nn.Linear produce different results

I have a seq2seq task and a model starting with an Embedding layer to process the tokenized input. I noticed that the performance using nn.Embedding is not great, but when using nn.Linear it is.
Shouldn’t this:

self.embedding_layer = nn.Embedding(self.vocab_size, 512)
embedding_out = self.embedding_layer(input)

Be the same as this?:

self.embedding_layer = nn.Sequential(
   nn.Linear(vocab_size, 256),
   nn.Linear(256, 512)
)
embedding_out = self.embedding_layer(F.one_hot(input))

Yet, the latter produces much better results. Does anyone have any insights on why this is? I would much prefer to use the Embedding layer as it is more readable, but I do not seem to figure out why it does not perform well.

For a fair comparison, shouldn’t it look like this:

self.embedding_layer = nn.Sequential(
   nn.Linear(vocab_size, 512),
)
embedding_out = self.embedding_layer(F.one_hot(input))

or simply

self.embedding_layer = nn.Linear(vocab_size, 512)

embedding_out = self.embedding_layer(F.one_hot(input))

nn.Embedding is essentially just Linear Layer, but your current “replacement” contains 2 Linear Layers

Yes, you are right. It is not a fair comparison. Yet, I do not understand how can one produce good results, while the other does not. There should not be a significant difference between them.