Hi. I’m currently working on a personal reimplementation of the Transformer paper and had a question.
On page 5 in section “3.4 Embeddings and Softmax,” it states:
In our model, we share the same weight matrix between the two embedding layers and the pre-softmax linear transformation.
I’ve currently implemented my model to use just one embedding layer for both source and target tensors, but I’m wondering if there would be a way that I could use the weights of the embedding layer as a linear layer. What I’ve currently done is something like:
output = previous_layer(previous_input)
final_output = torch.matmul(output, embedding_layer.embedding.weight.transpose(1, 0))
I’ve transposed the weight matrix before matrix multiplication because it’s of shape (vocab_size, embedding_dim)
and the shape of output
is (batch_size, seq_len, embedding_dim)
. Is this the proper way to use an embedding layer as a linear layer? If not, I’d like some tips on what I should be doing.
Thanks.