how you a train a vec2word model, i.e. something like a reverse nn.Embedding, which goes from a vector representation, to single words/one-hot representation?
So if I understand correctly, a cluster of points in embedded space represents similar words. Thus if you sample from that cluster, and use it as the input to vec2word, the output should be a mapping to similar words?
I guess I could do something similar to an encoder-decoder, but does it have to be that complicated/use so many parameters?
I think i understand how to train word2vec, using this TensorFlow tutorial, but how you the reverse in PyTorch?
The famous king-man+woman demo then outputs the nearest vectors according to some (possibly after scaling to unit norm as in “cosine similarity”) distance function.
So if your embedding weights are unit-length vectors you can do (similar to gensim most_similar calculate analogy word vector and then compute the distance ). Note that the matrix multiplication is a “batch dot product”.
# have an embedding
t = torch.nn.Embedding(100,10)
# normalize rows
normalized_embedding = t.weight/((t.weight**2).sum(0)**0.5).expand_as(t.weight)
# make up some vecotr
v = 1*t.weight[1]+0.1*t.weight[5]
# no
v = v / ((v**2).sum()**0.5).expand_as(v)
# similarity score (-1..1) and vocabulary indices
similarity, words = torch.topk(torch.mv(normalized_embedding,v),5)
(Maximal dot product is minimal Euclidian distance for unit vectors per $|x-y| = |x| - 2 x \cdot y + |y| $.)
For non-unit vectors you would need to compute the distances, e.g. by storing the lengths of the word vectors to compute $argmin_i |x(i)|-2 x(i) \cdot y$.
Of course, if you scoll down the linked gensim file, you’ll find lots of better ideas, with references.
That said, you might also just skip the word vectors in output. In OpenNMT’s train.py L293-295 the following single layer + softmax transforms the (not “word vector”, but some learnt hidden representation) output of the decoder:
great to hear from you, and awesome advice as usual
Following your guide, I think we can use a word embedding together with the Wasserstein divergence as an criterion?
Here’s a rough guide to how it might work?
import torch
from random import randint
num_words = 10
embedding_dim = 5
# have an embedding
# can initialize to -1 or +1
# or copy pretrained weights, see- https://discuss.pytorch.org/t/can-we-use-pre-trained-word-embeddings-for-weight-initialization-in-nn-embedding/1222
t = torch.nn.Embedding(num_words,embedding_dim)
# map from embedding to probability space
vec_to_prob = nn.Softmax()
# sample a random word to train on
word_idx = randint(0,num_words-1)
# a batch of 1 sample of 1 index/word
word_idx = Variable(torch.LongTensor([[word_idx]]))
# vector representation
word_vec = t(word_idx)
word_vec = word_vec.squeeze(0) # drop the batch dimension
# sanity check !!!
_ , closest_word_idx = torch.topk( torch.mv( t.weight , word_vec.squeeze(0) ) , 1 )
closest_word_idx == word_idx #true
# map to probability space,
# could be used to calculate the Wasserstein divergence as the training objective, with a histogram from a decoder
histogram_target = vec_to_prob(word_vec)
histogram_model = blah, blah, blah
wasserstein_loss = divergence( histogram_target , histogram_model )
# after training, histogram from decoder should be "close" to target histogram in Wasserstein space
histogram_model = histogram_target
_ , closest_word_idx = torch.topk( torch.mv( t.weight , histogram_model.squeeze(0) ) , 1 )
closest_word_idx == word_idx #true
It seems reasonable, and simple, so I’d like to try it Would really appreciate your opinion !