I’m building an RNN that uses nn.embedding() for its input and target. Unfortunately, I’m having trouble with the MSELoss, NLLLoss and other loss functions, which give me this error message when I try to run loss.backward:
AssertionError: nn criterions don't compute the gradient w.r.t. targets - please mark these variables as volatile or not requiring gradients
However, the nn.embeddings seems to require something similar to a gradient. What loss function can I use to go with the word embeddings I have chosen for target?
Hey! Thanks so much for your response. I super appreciate it. I’m not sure I explained my model well, so I apologize for any confusion.
I’m building a word-level LSTM RNN for text generation (I’ve already built the char-level and am hoping to compare the convergence rate of the two).
The corpus is so large, though, it’s computationally impossible to run the word-level on my GPU using one-hot vector encoding (the GPU crashes). I’d like to use nn.embedding() to create dense vectors with a fixed, reasonable dimension (I randomly picked 1x100), which makes it possible to run the model.
I’m having problems when I get to back-prop, though. That’s where I’m getting stuck in my understanding of what’s going on.
It’s my understanding if I’m feeding in a sequence of word-embedded vectors for each word, the loss function will compare the model’s output prediction with the word-embedded target, then I’ll call loss.backward for back-prop. Does that make it more clear why I’m using word-embedding for the target? Eventually, when sampling, I’ll convert the model’s word-embedding vector prediction to the word it represents.
Please let me know if this clears anything up. Again, I’m super grateful for your help with this.
I think one normally passes the predicted embeddings through an nn.Linear(n_embed, n_vocab) (and then Softmax or LogSoftmax). That should give you relative “probabilities” of the next word.
So I already have the predicted embeddings, I need to back-propagate the results. But when I pass in the two embeddings (one being the predicted embeddings, the other being the target embedding) to the loss function, I get this:
AssertionError: nn criterions don't compute the gradient w.r.t. targets - please mark these variables as volatile or not requiring gradients
I’d just like to know the best way to compute loss then back-propagate AFTER receiving the predicted output, which is also an embedded vector. Is it clear what I’m asking? Thanks again for your help.
Hi cooganb, when your model was done training, and you had it generate some text for you, what did you do when the predicted embedding wasn’t an actual word? It seems unreasonable to expect all 100 dimensions of the prediction to perfectly match a real word, so did you find the euclidean-closest embedding to the predicted one? Or maybe find the most cosine-similar embedding to the predicted one? Neither of those is getting good results for me, so I would really love to know how far you got with this.
As far as I understand, the usual thing to do is take a softmax with the inner products of embeddings with the output. This then puts you in the same place as you would be with a classification problem at the softmax layer.