Vec2word, or something similar?

Hi,

how you a train a vec2word model, i.e. something like a reverse nn.Embedding, which goes from a vector representation, to single words/one-hot representation?

So if I understand correctly, a cluster of points in embedded space represents similar words. Thus if you sample from that cluster, and use it as the input to vec2word, the output should be a mapping to similar words?

I guess I could do something similar to an encoder-decoder, but does it have to be that complicated/use so many parameters?

I think i understand how to train word2vec, using this TensorFlow tutorial, but how you the reverse in PyTorch?

Thanks a lot,

Ajay

3 Likes

Hey @AjayTalati,

there is no reverse of the embedding per se.

The famous king-man+woman demo then outputs the nearest vectors according to some (possibly after scaling to unit norm as in “cosine similarity”) distance function.

So if your embedding weights are unit-length vectors you can do (similar to gensim most_similar calculate analogy word vector and then compute the distance ). Note that the matrix multiplication is a “batch dot product”.

# have an embedding
t = torch.nn.Embedding(100,10)

# normalize rows
normalized_embedding = t.weight/((t.weight**2).sum(0)**0.5).expand_as(t.weight)

# make up some vecotr
v = 1*t.weight[1]+0.1*t.weight[5]
# no
v = v / ((v**2).sum()**0.5).expand_as(v)

# similarity score (-1..1) and vocabulary indices
similarity, words = torch.topk(torch.mv(normalized_embedding,v),5)

(Maximal dot product is minimal Euclidian distance for unit vectors per $|x-y| = |x| - 2 x \cdot y + |y| $.)

For non-unit vectors you would need to compute the distances, e.g. by storing the lengths of the word vectors to compute $argmin_i |x(i)|-2 x(i) \cdot y$.

Of course, if you scoll down the linked gensim file, you’ll find lots of better ideas, with references. :slight_smile:

That said, you might also just skip the word vectors in output. In OpenNMT’s train.py L293-295 the following single layer + softmax transforms the (not “word vector”, but some learnt hidden representation) output of the decoder:

generator = nn.Sequential(
        nn.Linear(opt.rnn_size, dicts['tgt'].size()),
        nn.LogSoftmax())

Best regards

Thomas

1 Like

Hey Thomas @tom ,

great to hear from you, and awesome advice as usual :smile:

Following your guide, I think we can use a word embedding together with the Wasserstein divergence as an criterion?

Here’s a rough guide to how it might work?

import torch
from random import randint

num_words = 10
embedding_dim = 5

# have an embedding  
# can initialize to -1 or +1
# or copy pretrained weights, see- https://discuss.pytorch.org/t/can-we-use-pre-trained-word-embeddings-for-weight-initialization-in-nn-embedding/1222
t = torch.nn.Embedding(num_words,embedding_dim)

# map from embedding to probability space
vec_to_prob = nn.Softmax()

# sample a random word to train on 
word_idx = randint(0,num_words-1)

# a batch of 1 sample of 1 index/word
word_idx = Variable(torch.LongTensor([[word_idx]]))

# vector representation
word_vec = t(word_idx)
word_vec = word_vec.squeeze(0) # drop the batch dimension

# sanity check !!! 
_ , closest_word_idx = torch.topk( torch.mv( t.weight , word_vec.squeeze(0) ) , 1 )
closest_word_idx == word_idx #true

# map to probability space, 
# could be used to calculate the Wasserstein divergence as the training objective, with a histogram from a decoder 
histogram_target = vec_to_prob(word_vec)
histogram_model = blah, blah, blah
wasserstein_loss = divergence( histogram_target , histogram_model )

# after training, histogram from decoder should be "close" to target histogram in Wasserstein space
histogram_model = histogram_target

_ , closest_word_idx = torch.topk( torch.mv( t.weight , histogram_model.squeeze(0) ) , 1 )
closest_word_idx == word_idx #true 

It seems reasonable, and simple, so I’d like to try it :smile: Would really appreciate your opinion !

Best regards,

Ajay