Allowing weight sharing in the last layer -- ValueError: can't optimize a non-leaf Tensor

rasbt · December 17, 2018, 4:07am

Hi all,

I need a bit of help with my PyTorch code. What I am trying to do is to share the weights in the last layer that connects to the output layer while the bias should still be independent.

I.e., what I am trying to do is to duplicate the weights in one row of the weight matrix of the last fully connected layer over the number of output units. E.g., suppose the last hidden layer and the output layer look like this:

44%20PM

What I want to achieve is that these weights are the same:

12%20PM

Suppose I have a convolutional neural network like this:

class ConvNet(torch.nn.Module):

    def __init__(self, num_classes):
        super(ConvNet, self).__init__()
        

        self.conv_1 = torch.nn.Conv2d(in_channels=3,
                                      out_channels=20,
                                      kernel_size=(5, 5),
                                      stride=(1, 1))

        self.conv_2 = torch.nn.Conv2d(in_channels=20,
                                      out_channels=40,
                                      kernel_size=(7, 7),
                                      stride=(1, 1),
                                      padding=1)                                  

        self.conv_3 = torch.nn.Conv2d(in_channels=40,
                                      out_channels=80,
                                      kernel_size=(11, 11),
                                      stride=(1, 1),
                                      padding=0)                                 
        ###############################################
        
        self.linear_1 = torch.nn.Linear(1*1*80, num_classes)
        
        # Weight sharing
        self.linear_1.weight[1:] = self.linear_1.weight[0]

        
    def forward(self, x):
        out = self.conv_1(x)
        out = F.relu(out)
        out = F.max_pool2d(out, kernel_size=(2, 2), stride=(2, 2))
        
        out = self.conv_2(out)
        out = F.relu(out)
        out = F.max_pool2d(out, kernel_size=(2, 2), stride=(2, 2))

        out = self.conv_3(out)
        out = F.relu(out)
        out = F.max_pool2d(out, kernel_size=(2, 2), stride=(2, 2))
        
        logits = self.linear_1(out.view(-1, 1*1*80))
        probas = F.softmax(logits, dim=1)

        return logits, probas

I thought I could maybe achieve this by setting

        # Weight sharing
        self.linear_1.weight[1:].requires_grad = False
        self.linear_1.weight[1:] = self.linear_1.weight[0]

or

        # Weight sharing
        self.linear_1.weight[1:] = self.linear_1.weight[0]

as shown in the code example above. Unfortunately, this throws an ValueError: can't optimize a non-leaf Tensor.

Another thing I tried was

        # Weight sharing
        self.linear_1.weight[1:] = self.linear_1.weight[1:].detach()
        self.linear_1.weight[1:]= self.linear_1.weight[0].detach()

But this yields the same error.

Does anyone have an idea how I could achieve this weight sharing in the last layer? I would really appreciate it!

vmirly1 · December 17, 2018, 4:21am

Would it work to define a single linear layer that has output of size 1 and disable the bias using self.linear = torch.nn.Linear(1*1*80, 1, bias=False)? Then you perform the linear FC layer only once and add the biases manually? You can define the bias tensors b1 and b2 as nn.Parameter and add the bias to the output of the final FC layer to get the two different outputs. Therefore, the weights of the FC layer are shared, and biases are defined separately and added independently to each output unit.

rasbt · December 17, 2018, 4:38am

Thanks, but I think this would not be an ideal work around, because if I have ~100 output units, I would have to do 100 separate (matrix * sharedvector + bias_i) operations. So, I thought that instead of doing that I want to have a weight matrix with sharedvector as columns so that the GPU can do the regular matrix-matrix multiplication.

vmirly1 · December 17, 2018, 4:45am

So, for example if the input to the FC layer has 200 units, and we want output of 100, a single linear is called that takes input of size 200, and an output of size 1. Then, the bias vector of size 100 is added to the output of linear layer using broadcasting operation.

rasbt · December 17, 2018, 4:53am

Thanks a lot Vahid. That’s a good point. Doing the matrix-matrix multiplication with the shared weights is wasteful. It’s much more efficient to only have a weight vector and then duplicate the outputs, and then add the bias to it. For future reference, the modification (which seems to work) is:

Have only a weight vector and define the bias manually

        self.linear_1 = torch.nn.Linear(1*1*80, 1, bias=False)
        self.linear_1_bias = nn.Parameter(torch.tensor(torch.zeros(num_classes),
                                                       dtype=self.linear_1.weight.dtype))

Then duplicate the outputs over all output units and add the bias vector:

        logits = self.linear_1(out.view(-1, 1*1*80))
        ones = torch.ones(num_classes, dtype=logits.dtype)
        ones = logits
        logits = logits + self.linear_1_bias