Updating part of an embedding matrix (only for out of vocab words)

This seems wrong, as you just split out the rows you need?

Are you sure? Even if you double the time spent in the embedding (forward + backward), the remainder of the model would still be the same and of the same speed.

The other solution besides keeping two instances seems to be clipping the gradient - either after the embedding backward in embedding.weight.grad or before by using a gradient hook on the embedding output. This makes the (true for typical household optimizers) assumption that the optimizer will leave parameter entries alone that consistently have zero gradient.

Best regards

Thomas