Standard approach for sharing embedding matrix across input/output?

I want to share a single matrix variable across input and output variable, ie per “Using the Output Embedding to Improve Language Models”, by Press and Wolf.

It seems like a clean-ish way to do this would be something like:

W = autograd.Variable(torch.rand(dim1, dim2), requires_grad=True)
input_embedding = tf.nn.Functional(W.transpose(0, 1))
output_embedding = tf.nn.Functional(W)

However, looks like embedding is not in Functional? So, I could create a standard non-fucntional Embedding, then grab the weight variable from that, and re-use that, in a simple matrix multiplication, but it seems a little ‘hacky’ somehow? So wondering what are the cleanest option(s) for this?

1 Like

Why don’t you use the module nn.Embedding?

input_embedding = nn.Embedding(dim1,dim2)
output_embedding =  nn.Embedding(dim2,dim1)
output_embedding.weight.data = input_embedding.weight.data.transpose(0,1)

I think that way, both input and output are sharing the same storage. If you modify input_embedding, the output_embedding will be changed as well.

3 Likes

Ah, good idea. Thanks! :slight_smile:

Hello, I wanted to know if sharing the input embedding with the output this way was equivalent to predict the embedding of the nearest neighbor output from the input ?