Updating part of an embedding matrix (only for out of vocab words)

Hello all,

TLDR: I would like to update only some rows of an embedding matrix for words that are out of vocab and keep the pre-trained embeddings frozen for the rows/words that have pre-trained embeddings.

I’ve seen some solutions (e.g. here which I got working) but from what I can see they mainly rely on maintaining another embedding matrix of the same size as the pre-trained/frozen one which is too slow in this instance for my use case (speed is crucial and this doubles the time per epoch in this case). What is the best (and fastest) way to do this?

I hope this is clear, please let me know if not (first time posting here).

Many thanks in advance,

Mark

1 Like

This seems wrong, as you just split out the rows you need?

Are you sure? Even if you double the time spent in the embedding (forward + backward), the remainder of the model would still be the same and of the same speed.

The other solution besides keeping two instances seems to be clipping the gradient - either after the embedding backward in embedding.weight.grad or before by using a gradient hook on the embedding output. This makes the (true for typical household optimizers) assumption that the optimizer will leave parameter entries alone that consistently have zero gradient.

Best regards

Thomas

Hi Thomas,

Thank you for your reply. I’m not sure I follow what you mean when you say just split out the rows as needed? I need some of the rows of the embedding matrix to have requires_grad = True and the rest set to False but it doesn’t seem possible to do this via slicing the embedding with a mask (which just denotes the out of vocab words) like:

self.embedding.weight[mask, :].requires_grad = True

Speed doubles as the model is very light so all the complexity/parameters resides in the embedding layers.

Apologies if this doesn’t make sense - I’m a little new around here so it’s likely my shortcoming.

Mark

P.S. The latter part of your reply was over my head I’m afraid.

For the first, how about doing it similar to this:

class PartiallyFixedEmbedding(torch.nn.Module):
    def __init__(self, fixed_weights, num_to_learn):
        super().__init__()
        self.num_fixed = fixed_weights.size(0)
        self.num_to_learn = num_to_learn
        weight = torch.empty(self.num_fixed+num_to_learn, fixed_weights.size(1))
        weight[:self.num_fixed] = fixed_weights
        self.trainable_weight= torch.nn.Parameter(torch.empty(num_to_learn, fixed_weights.size(1)))
        torch.nn.init.kaiming_uniform_(self.trainable_weight)
        weight[self.num_fixed:] = self.trainable_weight
        self.register_buffer('weight', weight)
    def forward(self, inp):
        self.weight.detach_()
        self.weight[self.num_fixed:] = self.trainable_weight
        return torch.nn.functional.embedding(
            inp, self.weight, None, None, 2.0, False, False)

Now the fixed_weights bits in weight won’t be trained, but the trainable_weight will be.
You do have two copies of the trainable weights and copy them on forward, but the fixed weights will just sit there.
This can be elaborated into taking more parameters of the standard embedding layer, but I left it out now. I will run into trouble if you use the embedding twice in a single forward (because it does the detach).

The second option I (all too tersely) tried to describe is something like

class PartiallyFixedEmbedding2(torch.nn.Module):
    def __init__(self, fixed_weights, num_to_learn):
        super().__init__()
        self.num_fixed = fixed_weights.size(0)
        self.num_to_learn = num_to_learn
        weight = torch.empty(self.num_fixed+num_to_learn, fixed_weights.size(1))
        weight[:self.num_fixed] = fixed_weights
        torch.nn.init.kaiming_uniform_(weight[self.num_fixed:])
        self.weight= torch.nn.Parameter(weight)
        self.register_buffer('learnable_mask', (torch.arange(self.num_fixed + num_to_learn).unsqueeze(1) >= self.num_fixed).float()) # could be more flexible
    def forward(self, inp):
        def zero_grad_fixed(gr):
            return gr*self.learnable_mask
        self.weight.register_hook(zero_grad_fixed)
        return torch.nn.functional.embedding(
            inp, self.weight, None, None, 2.0, False, False)

Here the zero_grad_fixed zeros the gradient. It does so on the gradient of the weights which, for large embeddings, isn’t efficient. In that case you can, however, use the same technique to compute the mask on inp and add the hook on the output of the embedding call. That way you’d multiply the gradient of the outputs of the embedding.

Best regards

Thomas

4 Likes

Hi Thomas,

Firstly, thank you for taking the time to provide this (and apologies for the slow reply, I’ve been away) - it’s incredibly helpful.

I haven’t implemented it yet but if I’ve understood correctly you are relying on the word to number mapping being done in such a way that all the out of vocab words are at the bottom of the embedding matrix? If this is the case is it not possible to set-up things to have two embedding matrices?

For example if we end up constructing the word to number mapping for a vocab of 75k with 25k words with no pre-trained word vectors could we do something like:

  • All words with pre-trained embeddings have numbers less than 50,000.
  • All out of vocab words have numbers 50,000 and above
  • Have 2 embedding matrices, one with 50,001 rows and one with 25,001 rows.
  • On the forward pass: Any numbers in the sentences above 50k set to value 50,000 (e.g. they’ll get row 50,001 in the pre-trained embeddings) and then point to pre-trained embeddings to grab pre-trained word vectors.
  • Also on forward pass: alongside this (with a copy of the data?) set any numbers less than 50k to 25,000 (e.g. they’ll get row 25,001 in the trained embeddings) and subtract 50,000 from all the word numbers. Point this to the embedding matrix I wish to train.

That said the ideal would be not to have to mess about with the way the word to number mapping is done.

Apologies if I’ve missed the boat and thanks again for your help.

Mark

That strategy would probably work. If you can afford to do two embeddings, you can use .clamp and then torch.where to get the right embeddings.

I’d venture that sorting the entries (to have the fixed embeddings in one range and the trainable in a separate) is likely more efficient than having a mask, but I think the mask idea alluded to above might also be a good choice.

Best regards

Thomas

Thank you again Thomas.

Just to confirm I haven’t misunderstood, your PartiallyFixedEmbedding solution relies on the word to id mapping being done in such a way that the out of vocab words are the higher id values? This is also the case for my proposed solution…unfortunately it’s not clear to me how to efficiently do this without digging into the underlying tokenization i.e. as we map a word to an id (usually based on count) we need to check if it has an embedding or not then modify this. I have a few thoughts about how to do this and will revert back if I get a solution.

I do find it a little puzzling that there is no facility to only partially train an embedding matrix (which is generally huge)…it’s basically all or nothing.

Thanks,

Mark

Personally, I think it’s most efficient to do the reindexing beforehand similar to

NUM_WORDS = 1000
is_fixed = torch.bernoulli(torch.full((NUM_WORDS,),0.5)) # dummy
fixed_to_original = [i for i in range(NUM_WORDS) if is_fixed[i]]
num_fixed = len(fixed_to_original)
trained_to_original = [i for i in range(NUM_WORDS) if not is_fixed[i]]
sorted_to_original = fixed_to_original + trained_to_original
original_to_sorted = {orig: new for new, orig in enumerate(sorted_to_original)}

but if you want to use an arbitrary mask and recreate that for every batch you feed by looking up whether input words are fixed-vector ones, that’s not too hard either.

Best regards

Thomas

:thinking:

Thank you Thomas. I might be testing your patience now but I have data like:

max_words, data_sz, max_len = 75000, 1000000, 75
train_X = torch.randint(0, max_words, (data_sz, max_len), dtype=torch.long)

The value of the id in train_X will direct it to a corresponding row in an embedding matrix. We wish to use 2 embedding matrices and of the 75,000 ids in train_X, say, 15,000 of them are out of vocab. We thus wish to efficiently direct these ids to another embedding somehow …i.e. re-indexing them so they are numbered between 0-15,000 - I’m still not clear on this.

Hope this makes sense - I have another thing to try if this fails so no worries either way.

Mark

It’s not efficient (I’d do it as a one-off, when 30 seconds don’t matter), but

max_words, data_sz, max_len = 75000, 1000000, 75
train_X = torch.randint(0, max_words, (data_sz, max_len), dtype=torch.long)
is_fixed = torch.bernoulli(torch.full((max_words,),0.5)) # dummy, a boolean list which are fixed
fixed_to_original = [i for i in range(max_words) if is_fixed[i]]
num_fixed = len(fixed_to_original)
trained_to_original = [i for i in range(max_words) if not is_fixed[i]]
original_to_sorted = {orig: new for new, orig in enumerate(fixed_to_original + trained_to_original)}
train_X_mapped = train_X.clone().apply_(original_to_sorted.get)

would seem to do that.

Best regards

Thomas

1 Like

30 seconds is fine! I had tried a numpy way to reindex which was horrible before.

Just want to say a huge thank you for your help and patience Thomas - truly appreciated.

Mark

Hi, I was wondering if it makes sense to just reset relevant gradients between loss.backward() and optimizer.step() to achieve this goal. So basically do the following: word_embed.weight.grad[1000:] = 0.

I think you’d need something that iterates over model.parameters() to zero the ones you want out (if you can identify them):

E.g. something like this adapted from the main tutorial.

with torch.no_grad():
    for param in model.parameters():
        param.grad = param.grad*0

The problem with this is you’re still calculating all the gradients which can be expensive.

Good idea! It works on my model when runing on a single gpu. But when I use DataParallel to run on multiple gpus, an error occured: RuntimeError: Can't detach views in-place. Use detach() instead. Do you have any advice to solve it? Thanks~

You could sidestep the error message by using weight = self.weight.detach() to get a local tensor (sharing the same storage), and then use weight in place of self.weight.
I don’t have advice whether the inplace modification plays well with DataParallel.

Best regards

Thomas