Loss of performance while removing nn.Embedding

Remorax · May 12, 2020, 5:00pm

Hi!

So I have a rather strange problem. I am interested in loading a large embedding file in my network, however when I use nn.Embedding my CUDA runs out of memory. So instead of loading the whole file, setting it to the weight parameter in nn.Embedding and setting require_grad to False, I pre-compute my input vectors and send them directly as embeddings.

I am able to do this because I have a server with larger RAM (but no GPU) at my disposal, so I am able to convert my input vectors to embeddings prior to training my model, save it as a pickle file and load it while training my model on GPU. I thought this would improve my performance, since I am now using a larger embeddings file.

However, it results in an almost 3% decrease in precision and 2% decrease in recall! I rechecked by following the same procedure but using the smaller embeddings file. This time there’s a 3% decrease in both precision and recall, with the earlier (larger) file giving marginally better results.

My question is:

Does torch.nn.Embedding make any changes to Tensors when require_grad is set to False? Or does it optimize the network in any way?

In other words, what would explain the decrease in performance when replacing nn.Embedding by feeding precomputed vectors?

ptrblck · May 13, 2020, 3:43am

The performance of the model will depend on the usability of the used pretrained embedding.
Assuming you are using a standard pretrained embedding matrix, your inputs might not be well represented and the embedding might need some fine tuning to achieve reasonable results.

Could you explain your use case a bit more, i.e. which embedding are you using and what do your input tensors represent?

Remorax · May 13, 2020, 4:53am

Hey @ptrblck thanks for replying!

So my use case involves predicting relationships between words in text, so my input is a dict where the key is a 4-tuple (word, pos_tag, dep_tag, dir) and the value is the frequency of the corresponding path. I feed these input into an LSTM, which uses the tuple and the count to predict the type of relation.

Regarding the 4-tuple, all of these are fed as indices, and inside the network, an nn.Embedding layer is used for each. Now since I want pos_tag, dep_tag and dir embeddings to update as the model trains, I let nn.Embedding layers for these terms to remain as they are. But for word embeddings, I precompute the embeddings using pretrained Wiki2Vec model. The embedding dimension is 300. Earlier I was using Glove word embeddings, but since they weren’t very well suited to my task, I decided to switch to Wiki2Vec.

Hope that explains my use case enough.

Also, could you please explain what you mean by inputs not being “well represented”?

ptrblck · May 13, 2020, 5:03am

An embedding layer contains a trainable weight matrix, which is used as a “lookup table” to transform the sparse indices into a dense vector representation.
E.g. pretrained embeddings might yield outputs, which are meaningful for particular works, such that the distance corresponds to a meaning. A well-known example would be the vector arithmetic with the words: “king - man = queen”, where the output vectors were subtracted and the “queen” vector was the result.

If you don’t train these embedding layers, the vector representation should be random.

Remorax · May 13, 2020, 5:06am

Ah I think I get you. Just one question though, this trainable weight matrix that you mention, does it train even when I set require_grad to False?

ptrblck · May 13, 2020, 5:10am

No, you would have to leave the requires_grad attribute as True, so that gradients will be calculated and the optimizer can update the weight parameter.

Remorax · May 13, 2020, 5:15am

But that’s the thing. I set requires_grad to False when using pretrained embeddings, even before I removed the nn.Embedding layer, because I didn’t want the pretrained Wiki2Vec embeddings to update. Now that I have removed that layer, and I include the Wiki2Vec embeddings in the input only, I observe a substantial decrease in performance.

And I can’t figure out why.

ptrblck · May 13, 2020, 5:21am

I’m apparently misunderstanding the use case.
How did you remove the layer but include the Wiki2Vec embeddings?

Remorax · May 13, 2020, 5:25am

So remember the 4-tuple of indices (word, pos_tag, dep_tag and dir) that I mentioned? I change this 4 tuple so that instead of the word index, I pass the embedding vector of the word and keep the other 3 params in the tuple (pos_tag, dep_tag and dir) the exact same. This way, i don’t have to load the entire embeddings file in the pytorch Embeddings layer, and only do it for the words I am training my model on.

Hope that clarifies my use case now. Sorry for not explaining it well earlier!

ptrblck · May 13, 2020, 5:31am

OK, so only word is passed to the embedding layer, right?
What happens to the other tags? Do you pass them directly to your model and what are they representing? Are they encoded as indices as well?

Remorax · May 13, 2020, 5:52am

No, word is not passed to the embedding layer. I don’t need to, since I already pass the word embedding as input to my network. In other words, I load the wiki2vec model using gensim in another script, convert the words to embeddings, keep the other three tags ( pos_tag , dep_tag and dir ) the exact same, convert them to input tensors in the format needed, and then pass it as input to the network.

I need to do this because the embeddings file is large, and doesn’t fit into my network. But I do have a high RAM server at my disposal, which I use for loading the model using gensim and converting word to its embedding. Hence my reason for running 2 scripts, one that converts text into all these tags and converts the word into embeddings; and another script that contains the actual LSTM model.

The other tags represent: pos_tag as is self explanatory, represents the POS tag of a word. dep_tag represents the dependency label of the word. dir represents the direction of the path and can take two values, pos or neg.

These tags are passed as indices to the model. In other words, I construct a set of all dependency/POS tags and give each an index. There is an nn.Embedding layer in my network which takes these indices and converts them into embeddings of a specified dimension. Obviously, since they aren’t actual English words, I don’t load pretrained embeddings for these tags. I don’t set require_grad to False here, they get updated as the model trains.

The only place where require_grad is set to False is when word embeddings need to be calculated, since here I am loading a pretrained embeddings file. But now, I am removing the nn.Embedding layer for the word and passing it as input to the model.

Let me just give an example as well, to make my point clear.

My input is of the form (‘permit’, ‘NN’, ‘obj’, ‘pos’). Here “permit” is the word, “NN” the POS tag, “obj” the dependency tag and “pos” the direction (“pos” is short for positive).

This tuple is then converted to indices like, say, (77, 3, 12, 1). So, here “permit” is the 77th word in my vocabulary, “NN” is the 3rd POS tag, “obj” is the 12th dependency tag and “pos” is 1st direction tag.

This was the earlier format of the input.

Now instead of (77, 3, 12, 1) I have: (tuple(embedding), 3, 12, 1) where embedding is the 300-dimensional word embedding of “permit”. This is input to the network.

For pos_tag, dep_tag and dir, there are 3 separate nn.Embedding layers that are initialized randomly, with weight_grad set as True. These layers take these indices (3, 12 and 1) and output the corresponding embeddings.

Remorax · May 13, 2020, 6:20am

So to reiterate, since I am basically changing nothing except for the way I supply embeddings, why am I observing a decreasing in performance? Are there optimizations pytorch makes on embeddings even when requires_grad is set to False?

Any help would be greatly appreciated! Thanks.

ptrblck · May 14, 2020, 12:30am

Thanks for the information.

No, that shouldn’t be the case.

As you’ve described, you are only changing the way the work embedding is calculated and passed to the model, so for debugging you could compare the embedding vectors for a fixed batch and check, if both approaches yield the same output.