how you a train a vec2word model, i.e. something like a reverse nn.Embedding, which goes from a vector representation, to single words/one-hot representation?

So if I understand correctly, a cluster of points in embedded space represents similar words. Thus if you sample from that cluster, and use it as the input to vec2word, the output should be a mapping to similar words?

I guess I could do something similar to an encoder-decoder, but does it have to be that complicated/use so many parameters?

I think i understand how to train word2vec, using this TensorFlow tutorial, but how you the reverse in PyTorch?

The famous king-man+woman demo then outputs the nearest vectors according to some (possibly after scaling to unit norm as in â€ścosine similarityâ€ť) distance function.

So if your embedding weights are unit-length vectors you can do (similar to gensim most_similar calculate analogy word vector and then compute the distance ). Note that the matrix multiplication is a â€śbatch dot productâ€ť.

# have an embedding
t = torch.nn.Embedding(100,10)
# normalize rows
normalized_embedding = t.weight/((t.weight**2).sum(0)**0.5).expand_as(t.weight)
# make up some vecotr
v = 1*t.weight[1]+0.1*t.weight[5]
# no
v = v / ((v**2).sum()**0.5).expand_as(v)
# similarity score (-1..1) and vocabulary indices
similarity, words = torch.topk(torch.mv(normalized_embedding,v),5)

(Maximal dot product is minimal Euclidian distance for unit vectors per $|x-y| = |x| - 2 x \cdot y + |y| $.)

For non-unit vectors you would need to compute the distances, e.g. by storing the lengths of the word vectors to compute $argmin_i |x(i)|-2 x(i) \cdot y$.

Of course, if you scoll down the linked gensim file, youâ€™ll find lots of better ideas, with references.

That said, you might also just skip the word vectors in output. In OpenNMTâ€™s train.py L293-295 the following single layer + softmax transforms the (not â€śword vectorâ€ť, but some learnt hidden representation) output of the decoder:

great to hear from you, and awesome advice as usual

Following your guide, I think we can use a word embedding together with the Wasserstein divergence as an criterion?

Hereâ€™s a rough guide to how it might work?

import torch
from random import randint
num_words = 10
embedding_dim = 5
# have an embedding
# can initialize to -1 or +1
# or copy pretrained weights, see- https://discuss.pytorch.org/t/can-we-use-pre-trained-word-embeddings-for-weight-initialization-in-nn-embedding/1222
t = torch.nn.Embedding(num_words,embedding_dim)
# map from embedding to probability space
vec_to_prob = nn.Softmax()
# sample a random word to train on
word_idx = randint(0,num_words-1)
# a batch of 1 sample of 1 index/word
word_idx = Variable(torch.LongTensor([[word_idx]]))
# vector representation
word_vec = t(word_idx)
word_vec = word_vec.squeeze(0) # drop the batch dimension
# sanity check !!!
_ , closest_word_idx = torch.topk( torch.mv( t.weight , word_vec.squeeze(0) ) , 1 )
closest_word_idx == word_idx #true
# map to probability space,
# could be used to calculate the Wasserstein divergence as the training objective, with a histogram from a decoder
histogram_target = vec_to_prob(word_vec)
histogram_model = blah, blah, blah
wasserstein_loss = divergence( histogram_target , histogram_model )
# after training, histogram from decoder should be "close" to target histogram in Wasserstein space
histogram_model = histogram_target
_ , closest_word_idx = torch.topk( torch.mv( t.weight , histogram_model.squeeze(0) ) , 1 )
closest_word_idx == word_idx #true

It seems reasonable, and simple, so Iâ€™d like to try it Would really appreciate your opinion !