Normalizing Embeddings

I’m trying to manually normalize my embeddings with their L2-norms instead of using pytorch max_norm (as max_norm seems to have some bugs). I’m following this link and below is my code:

emb = torch.nn.Embedding(4, 2)
norms = torch.norm(emb.weight, p=2, dim=1).detach()
emb.weight = emb.weight.div(norms.expand_as(emb.weight))

But I’m getting the following error:

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/site-packages/torch/autograd/", line 725, in expand_as
return Expand.apply(self, (tensor.size(),))
File "/usr/local/lib/python2.7/site-packages/torch/autograd/_functions/", line 111, in forward
result = i.expand(*new_size)
RuntimeError: The expanded size of the tensor (2) must match the existing size (4) at non-singleton dimension 1. at /Users/soumith/code/builder/wheel/pytorch-src/torch/lib/TH/generic/THTensor.c:308

When I look at the size of norms, it’s (4L,)
Any idea where I’m going wrong? Thanks!

1 Like

shouldn’t it be?

emb.weight = emb.weight.div(norms.expand_as(emb.weight))

Yes, that was a typo. I edited the code, and also edited the error. Any idea why I’m getting that error?

emb = torch.nn.Embedding(4, 2)
norms = torch.norm(emb.weight, p=2, dim=1).data =,1).expand_as(emb.weight))

Thanks Chen. It worked.
I realized that you have removed the “detach()” from the second line. Is it because “norms” is not a “Variable” anymore?

Yes, you’re right.
I prefer


What is the use of detach(), I do not think it is necessary here. I think we can also use

emb = torch.nn.Embedding(4,2)
norm = emb.weight.norm(p=2, dim=1, keepdim=True)
emb.weight = emb.weight.div(norm.expand_as(emb.weight))

Is there any problems with the above snippet?

1 Like

My understanding is that if we don’t detach it, then norm will be a variable and PyTorch will aim at optimizing its values in the backward phase when calculating the gradients. But we don’t really want to optimize the norm values.

The type of norm is torch Variable. PyTorch will only calculate the the gradient of loss w.r.t to the leaf node. Since norm is not a leaf node, I do think it will be updated when we do optimizer.step(). Only emb.weight will be updated since it is of type torch.nn.Parameter and it is the learnable parameter of the module.

1 Like

@jdhao I was seriously looking over the web for this kind of answer, Sorry of taking this discussion again up. Could you please explain more why norm is not a leaf node in the gradient computation ?

So this doesn’t precisely replicate the max-norm functionality because it’s not checking if the norm of the vector is less than max-norm, right (in which case, this function would increase the norm of those vectors)?