How to do weight normalization in last classification layer?

wenying_dai · January 21, 2019, 8:44am

Many loss functions in face recognition will norm feature or norm weight before calculate softmax loss, such as normface(https://arxiv.org/abs/1704.06369), L2-softmax(https://arxiv.org/abs/1703.09507) and so on.

I’d like to know how to norm weight in the last classification layer.

self.feature =  torch.nn.Linear(7*7*64, 2) # Feature extract layer
self.pred = torch.nn.Linear(2, 10, bias=False) # Classification layer

I want to replace the weight parameter in self.pred module with a normalized one.
In another word, I want to replace weight in-place, like this:

self.pred.weight = self.pred.weight / torch.norm(self.pred.weight, dim=1, keepdim=True)

When I trying to do this, there is something wrong:

TypeError: cannot assign 'torch.FloatTensor' as parameter 'weight' (torch.nn.Parameter or None expected)

I am new comer to pytorch, I don’t know what is the standard way to handle this. Thanks a lot!

Here is the whole code:

class Model(torch.nn.Module):

    def __init__(self):
        super(Model, self).__init__()

        self.backbone = torch.nn.Sequential(

            torch.nn.Conv2d(1, 8, kernel_size=7, stride=1, padding=3),

            # 28 * 28
            torch.nn.Conv2d(8, 16, kernel_size=5, stride=1, padding=2),
            torch.nn.ReLU(),
            torch.nn.MaxPool2d(stride=2, kernel_size=2),

            # 14 * 14
            torch.nn.Conv2d(16, 32, kernel_size=3, stride=1, padding=1),
            torch.nn.ReLU(),
            torch.nn.Conv2d(32, 32, kernel_size=3, stride=1, padding=1),
            torch.nn.ReLU(),
            torch.nn.MaxPool2d(stride=2, kernel_size=2),

            # 7 * 7
            torch.nn.Conv2d(32, 64, kernel_size=3, stride=1, padding=1),
            torch.nn.ReLU(),
            torch.nn.Conv2d(64, 64, kernel_size=3, stride=1, padding=1),
            torch.nn.ReLU(),

        )

        self.feature =  torch.nn.Linear(7*7*64, 2) # Feature extract layer
        self.pred = torch.nn.Linear(2, 10, bias=False) # Classification layer

        ## something wrong here
        self.pred.weight = self.pred.weight / torch.norm(self.pred.weight, dim=1, keepdim=True)


        for m in self.modules():
            if isinstance(m, torch.nn.Conv2d):
                torch.nn.init.xavier_normal_(m.weight)
            elif isinstance(m, torch.nn.BatchNorm2d):
                torch.nn.init.constant_(m.weight, 1)
                torch.nn.init.constant_(m.bias, 0)


    def forward(self, x):

        x = self.backbone(x)
        x = x.view(-1, 7 * 7 * 64)
        x = self.feature(x)
        x = self.pred(x)

        return x

Besides,

if I don’t want to use view function in forward function, how to deal with it in __init__ function？
how I can extract the weight, which is before normalization and after normalization?
If I want to norm feature which is the output feature map in the last two fc layer in __init__ function, how to do it?

Thanks a lot!

wenying_dai · January 21, 2019, 8:51am

Besides, I have read the doc about WeightNorm(https://pytorch.org/docs/stable/_modules/torch/nn/utils/weight_norm.html)

I just need one normalized weight, and do not need both of g and v.

ptrblck · January 22, 2019, 4:20am

This code should work:

        self.pred = torch.nn.Linear(2, 10, bias=False)

        with torch.no_grad():
            self.pred.weight.div_(torch.norm(self.pred.weight, dim=1, keepdim=True))
        ...

You have to flatten the activation somehow, so .view would be the easiest way.
Alternatively, you could write a Flatten module, initialize if in your model’s __init__, and call it in your forward. I’m not sure, if I understood your question correctly, so let me know if I missed something.
You can just access it as you’ve already done normalizing the weights: print(self.pred.weight).
You could norm the activation in the forward method (similar to your weight norm code).

wenying_dai · January 23, 2019, 8:20am

I am still not clear about " with torch.no_grad():"

with torch.no_grad():
            self.pred.weight.div_(torch.norm(self.pred.weight, dim=1, keepdim=True))

Is this means that the parameter of weight will not be update?

Acutally, this is what I want. And I want to know how to get

param: weight which is not normalized
the weight which is used in fc calculation

ptrblck · January 23, 2019, 6:30pm

The torch.no_grad() guard just makes sure that the operations in this block won’t be recorded by Autograd. The parameter will still be updated in your main training loop.

It sounds like points 1. and 2. are referring to the same parameters.
You can get the weight used in the linear layer with:

self.pred,weight  # inside the model class
# or
model.pred.weight  # outside the model class

sharifza · February 6, 2020, 11:20am

Don’t you need to apply that in every forward step? here it seems like you just normalize it after initialization but the gradients might change it…

jfyeh · March 2, 2020, 11:36am

Hi ptrblck,

I also need to do weight initialization in current project, but I need to do it in every forward pass.
So the torch.no_grad() method is not suit for me.

I found the solution in here.

self.pred.weight = torch.nn.Parameter(self.pred.weight / torch.norm(self.pred.weight, dim=1, keepdim=True))

I wanna know those cast operation(cast Parameter to Tensor) will affect the gradient flow or not ?
Or there is a better way to do weight normalization in forward pass ?

Thank you so much.

ptrblck · March 2, 2020, 2:57pm

I’m not sure what your workflow is, as you are initializing the parameter in each forward pass, thus removing whatever was changed during training.

If you don’t need to update this parameter, you might just use a plain tensor for its operations.

jfyeh · March 2, 2020, 3:33pm

Hi, thank you for your quick reply.

I am using the loss functions borrowed from face recognition modules(like CosFace, ArcFace, etc).

In those loss function, they will normalize both weights of the last linear layer and features before the matmul operation. (for calculate the cosine values)
If I understand this correctly, I am not trying to initialize those weights, I need both of weights and features are normalized before the matmul operation.
So I think the normalize operation should also be considered when calculating the gradient ?
That’s why I said torch.no_grad() is not suit for me.

Please correct me if I misunderstood something, many thanks.

Ashima_Garg · January 8, 2022, 8:27am

Hi, I tried this with bias set to default value i.e. True.
However, I got this error:

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation.

When I removed with torch.no_grad(), I got this error:

 RuntimeError: a leaf Variable that requires grad is being used in an in-place operation.

Any ideas on what else can be tried out?

ptrblck · January 9, 2022, 2:50am

Could you post a code snippet to reproduce the error, please?

Ashima_Garg · January 9, 2022, 4:27am

Hi @ptrblck, Thanks for your quick response!

Following is the code snippet that reproduces the error.

import torch
import torch.nn as nn
import torch.optim as optim


class LinearLayer(nn.Module):
    def __init__(self):
        super(LinearLayer, self).__init__()
        self.classifier = nn.Sequential(nn.Linear(10, 6))

    def forward(self, x):
        with torch.no_grad():                     ### 1
            self.classifier[0].weight.div_(torch.norm(self.classifier[0].weight, dim=1, keepdim=True))
        out = self.classifier(x)
        with torch.no_grad():                     ### 2
            self.classifier[0].weight.div_(torch.norm(self.classifier[0].weight, dim=1, keepdim=True))
        return out

linear = LinearLayer()
optimizer = optim.SGD([{'params': linear.classifier.parameters(), 'lr': 0.1},], 
                      momentum=0.9, weight_decay=5e-4)


for i in range(10):
    x = torch.randn(4, 10)
    targets = torch.LongTensor([1, 2, 3, 4])
    optimizer.zero_grad()
    out = linear(x)
    print(out.size())
    print(targets.size())
    loss = nn.CrossEntropyLoss()(out, targets)
    loss.backward()
    optimizer.step()

Following is the error that this code will give.

Error: RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [10, 6]], which is output 0 of TBackward, is at version 3; expected version 2 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

For my use case, I need to normalize the weights of the classification layer before (### 1) and after (### 2) forward pass. I need to do few more operations on the weights after the ###2.

The code will work fine if I remove the second normalization i.e. ###2.

Before I looked at this post, I had posted few alternatives for this normalization here. Please can you verify if my Implementation 1 is correct in that post?

ptrblck · January 9, 2022, 8:44pm

I don’t think ### 2 is a valid operation as you are manipulating the weigths inplace after using them, which would update the weights and make the forward activations stale. Could you create a new parameter after ### 2 which is used in the needed operations?

Ashima_Garg · January 10, 2022, 3:47am

Hi, Thanks again!
Will this be the right way to do that? Here I have created variable for both operations ###1 and ###2

def __init__(self):
    super(MyModel, self).__init__()
    self.linear = nn.Linear(2, 2)

def forward(self, x):
    weight = F.normalize(self.linear.weight)                  ### 1
    out = torch.mm(x, weight.t()) + self.linear.bias
    weight = F.normalize(weight)                              ### 2
    return out

Will this be the same as that of first ###1 in place operation followed by creating a new parameter for ###2 operation.

def __init__(self):
    super(MyModel, self).__init__()
    self.linear = nn.Linear(2, 2)

def forward(self, x):
    with torch.no_grad():                     ### 1
            self.classifier[0].weight.div_(torch.norm(self.classifier[0].weight, dim=1, keepdim=True))
    out = torch.mm(x, weight.t()) + self.linear.bias
    weight = F.normalize(weight)             ### 2
    return out

Thanks,

ptrblck · January 10, 2022, 4:45am

I guess both approaches should work (you should get an error message if they are causing other issues), but they won’t be the same. In the second approach you are manipulating the weight of self.classifier[0] inplace, which would also change its execution in case you are calling self.classifier[0](input) at one point and would also be visible in the state_dict if you store the model.

Vishhvak_Srinivasan · March 21, 2022, 2:35am

Did you figure out how to implement weight normalization for acrface? Would really like to know, since I’m struggling with the same issue.