Proper way to use an embedding layer as a linear layer?

seankala · December 26, 2020, 10:58am

Hi. I’m currently working on a personal reimplementation of the Transformer paper and had a question.

On page 5 in section “3.4 Embeddings and Softmax,” it states:

In our model, we share the same weight matrix between the two embedding layers and the pre-softmax linear transformation.

I’ve currently implemented my model to use just one embedding layer for both source and target tensors, but I’m wondering if there would be a way that I could use the weights of the embedding layer as a linear layer. What I’ve currently done is something like:

output = previous_layer(previous_input)
final_output = torch.matmul(output, embedding_layer.embedding.weight.transpose(1, 0))

I’ve transposed the weight matrix before matrix multiplication because it’s of shape (vocab_size, embedding_dim) and the shape of output is (batch_size, seq_len, embedding_dim). Is this the proper way to use an embedding layer as a linear layer? If not, I’d like some tips on what I should be doing.

Thanks.