Using word embedding with loss function

cooganb · November 14, 2017, 1:38am

Hi there friends!

I’m building an RNN that uses nn.embedding() for its input and target. Unfortunately, I’m having trouble with the MSELoss, NLLLoss and other loss functions, which give me this error message when I try to run loss.backward:

AssertionError: nn criterions don't compute the gradient w.r.t. targets - please mark these variables as volatile or not requiring gradients

However, the nn.embeddings seems to require something similar to a gradient. What loss function can I use to go with the word embeddings I have chosen for target?

Thanks in advance!

colesbury · November 14, 2017, 7:58am

I don’t understand. Are you trying to learn the targets? Why do the targets come from an embedding?

If you really want to back propagate to the targets you can just compute the MSE loss by:

loss = ((input - target)**2).mean()

cooganb · November 14, 2017, 2:10pm

Hey! Thanks so much for your response. I super appreciate it. I’m not sure I explained my model well, so I apologize for any confusion.

I’m building a word-level LSTM RNN for text generation (I’ve already built the char-level and am hoping to compare the convergence rate of the two).

The corpus is so large, though, it’s computationally impossible to run the word-level on my GPU using one-hot vector encoding (the GPU crashes). I’d like to use nn.embedding() to create dense vectors with a fixed, reasonable dimension (I randomly picked 1x100), which makes it possible to run the model.

I’m having problems when I get to back-prop, though. That’s where I’m getting stuck in my understanding of what’s going on.

It’s my understanding if I’m feeding in a sequence of word-embedded vectors for each word, the loss function will compare the model’s output prediction with the word-embedded target, then I’ll call loss.backward for back-prop. Does that make it more clear why I’m using word-embedding for the target? Eventually, when sampling, I’ll convert the model’s word-embedding vector prediction to the word it represents.

Please let me know if this clears anything up. Again, I’m super grateful for your help with this.

colesbury · November 14, 2017, 6:51pm

I think one normally passes the predicted embeddings through an nn.Linear(n_embed, n_vocab) (and then Softmax or LogSoftmax). That should give you relative “probabilities” of the next word.

cooganb · November 14, 2017, 9:22pm

Thanks for the response!

So I already have the predicted embeddings, I need to back-propagate the results. But when I pass in the two embeddings (one being the predicted embeddings, the other being the target embedding) to the loss function, I get this:

AssertionError: nn criterions don't compute the gradient w.r.t. targets - please mark these variables as volatile or not requiring gradients

I’d just like to know the best way to compute loss then back-propagate AFTER receiving the predicted output, which is also an embedded vector. Is it clear what I’m asking? Thanks again for your help.

cooganb · November 14, 2017, 9:31pm

(Let me know if seeing any of my code would be helpful and what in particular I should share.)

colesbury · November 14, 2017, 9:50pm

If your targets are fixed (i.e. you are not optimizing the targets), do loss_fn(input, target.detach()).

If you are also optimizing the targets, use the expression above.

cooganb · November 14, 2017, 10:05pm

That works! Thank you so much

paulsen.sean · March 14, 2018, 8:15pm

Hi cooganb, when your model was done training, and you had it generate some text for you, what did you do when the predicted embedding wasn’t an actual word? It seems unreasonable to expect all 100 dimensions of the prediction to perfectly match a real word, so did you find the euclidean-closest embedding to the predicted one? Or maybe find the most cosine-similar embedding to the predicted one? Neither of those is getting good results for me, so I would really love to know how far you got with this.

tom · March 14, 2018, 8:25pm

As far as I understand, the usual thing to do is take a softmax with the inner products of embeddings with the output. This then puts you in the same place as you would be with a classification problem at the softmax layer.

Best regards

Thomas

paulsen.sean · March 14, 2018, 8:37pm

Thanks, I’ll try that next. If it works I’ll name my first son after you.

sairam.pillai · June 20, 2018, 2:04pm

I am curious about what loss function did you use in the end? Which gave the best performance? Also what optimizer worked the best?