Initialize weights using the matrix multiplication result from two nn.Parameter

YJHuang · May 7, 2021, 11:21am

I have two tensor matrix, A $\in R^{nxm})$, and B $\in R^{mx1}$

a = nn.Parameter(A, requires_grad=True)
b = nn.Parameter(B, requires_grad=True)

Is it possible to use the matrix multiplication result from A*B = C $\in R^{nx1}$ as the initialize weights for the nn.Linear layer?

linear_weights = nn.Parameter(torch.matmul(a,b), requires_grad=True)
linear_layer.weight = linear_weights

Another two questions to the gradient updates for parameters $a$ and $b$.

Will the gradient updates the parameters $a$ and $b$ using above code?
If I change the required_grad=False to Parameter $a$ at the beginning, after that the $linear_weights$ has set to requires_grad=True. Will the Parameter $a$ still be frozen?

pascal_notsawo · May 7, 2021, 2:07pm

For your questions, just do some test to get the answer. Here is the code that supports the answers below.

import torch
class Model(torch.nn.Module) :
    def __init__(self) :
        super().__init__()
        torch.manual_seed(0)
        n, m = 1, 1
        A = torch.rand((n, m))
        B = torch.rand((m, 1))
        # AB ~ n x 1
        linear_layer = torch.nn.Linear(n, 1)

        flag = False
        if flag :
            # will raise this exception
            # TypeError: cannot assign 'torch.FloatTensor' as parameter 'weight' (torch.nn.Parameter or None expected)
            linear_layer.weight = torch.matmul(A,B)
        else :
            self.a = torch.nn.Parameter(A, requires_grad=True) # or A if you don't want to track the evolution of its value
            self.b = torch.nn.Parameter(B, requires_grad=True) # or B if you don't want to track the evolution of its value
            self.linear_weights = torch.nn.Parameter(torch.matmul(self.a,self.b), requires_grad=True)
            linear_layer.weight = self.linear_weights

        self.linear_layer = linear_layer
        
    def forward(self, x) :
        return self.linear_layer(x)  

def mae(y, y_pred) :
    """mean absolute error"""
    return (y - y_pred).abs().mean()


# data 
x = torch.tensor([1., 3, 3]).unsqueeze(dim=1) # unsqueeze to set batch_size to 3 ~~ [[1.], [3.], [3.]]
y = 2 * x
# model
model = Model()
# optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=1e-2)

model.state_dict()
"""
OrderedDict([('a', tensor([[0.4963]])),
             ('b', tensor([[0.7682]])),
             ('linear_weights', tensor([[0.3812]])),
             ('linear_layer.weight', tensor([[0.3812]])),
             ('linear_layer.bias', tensor([-0.7359]))])
"""

optimizer.zero_grad()
y_pred = model(x) # tensor([[-0.3547], [ 0.4078], [ 0.4078]], grad_fn=<AddmmBackward>)
loss = mae(y_pred, y) # tensor(4.5131, grad_fn=<MeanBackward0>)

loss.backward() 

optimizer.step()

model.state_dict()
"""
OrderedDict([('a', tensor([[0.4963]])), # no change
             ('b', tensor([[0.7682]])), # no change
             ('linear_weights', tensor([[0.3912]])), # change, because is the same variable as linear_layer.weight 
             ('linear_layer.weight', tensor([[0.3912]])), # change
             ('linear_layer.bias', tensor([-0.7259]))]) # change
"""

Is it possible to use the matrix multiplication result from A*B as the initialize weights for the nn.Linear layer?
No, AxB is still a tensor; so if you do it you will get an error like TypeError: cannot assign 'torch.FloatTensor' as parameter 'weight' (torch.nn.Parameter or None expected)
Will the gradient updates the parameters a and b using above code?
No, they are not involved in the computation graph.
linear_weights, the result of their multiplication, is another full variable, which intervenes naturally in the graph through linear_layer.weight, with which it shares the same memory address (I prefer to say that it is the same variable as linear_layer.weight)
If you want it to change, do this in forward: return torch.matmul(self.a,self.b)*x (they intervene directly in the computation graph, unlike above)
Then, after the forward and backward pass, you will have :

model.state_dict()
"""
OrderedDict([('a', tensor([[0.5063]])), # change
             ('b', tensor([[0.7782]])), # change
             ('linear_weights', tensor([[0.3812]])), # no change
             ('linear_layer.weight', tensor([[0.3812]])), # no change
             ('linear_layer.bias', tensor([-0.7359]))]) # no change
"""

If I change the required_grad=False to Parameter a at the beginning, after that the linear_weights has set to requires_grad=True. Will the Parameter a still be frozen?
Yes (see the code below)

n, m = 1, 1
A = torch.rand((n, m))
B = torch.rand((m, 1))
a = torch.nn.Parameter(A, requires_grad=False) # or A
b = torch.nn.Parameter(B, requires_grad=True) # or B
linear_weights = torch.nn.Parameter(torch.matmul(a,b), requires_grad=True)

a
"""
Parameter containing:
tensor([[0.3074]])
"""

linear_weights
"""
Parameter containing:
tensor([[0.1949]], requires_grad=True)
"""

YJHuang · May 7, 2021, 3:26pm

What if I want to assign to Conv2d layer’ weights instead nn.Linear layer? Is it still possible to update the a and b?

As you said, in linear layer, we can apply torch.matmul in forward(),
How about using the conv2d operator with initialize weight by torch.matmul(a,b), can i also do something similar?

pascal_notsawo · May 7, 2021, 3:31pm

If you want a parameter to be updated, make sure that it is of type nn.Parameter, that its attribute requires_grad is equal to True and that it is part of the computation graph.

YJHuang · May 7, 2021, 3:48pm

Thank you @pascal_notsawo,

Is it possible for me to initialize the weight of a 1x1 pointwise nn.Conv2d kernel by using the torch.matmul(a,b) result? I want the nn.Parameter(a), nn.Parameter(b) be update too.

So after the loss.backward() and optimizer.step(), I expect the torch.matmul(a’, b’), where a’ and b’ are the updated nn.Parameters, will have the same value as the updated 1x1 pointwise nn.Conv2d.weight.

pascal_notsawo · May 7, 2021, 4:14pm

You are welcome @YJHuang

It gets a little complicated.
If you look at the forward method of nn.Conv2d, you will notice this:

return self._conv_forward(input, self.weight, self.bias)

So try to inherit the nn.Conv2d class and modify the forward method by replacing self.weight with torch.matmul(self.a,self.b) and observe the parameter values as I did above.

I dare to believe that it will be ok, since going further down than that we will run into C++: the basic convolution operations of pytorch are implemented in C++ (pytorch/Convolution.cpp at 1465970a343e61f2f2b104859ca7f5d7e03f5d02 · pytorch/pytorch · GitHub). So one of the only alternatives if it doesn’t work is to do everything from scratch in python (forward method), by googling you can find stable implementations.

YJHuang · May 7, 2021, 4:30pm

The example you provided is very easy to follow and to observe the parameter values. I will also check these links and try to inherit the nn.Conv2d class.
Thank you!!

YJHuang · May 10, 2021, 1:09pm

It seems that the torch.matmul() will return a tensor value. So after I perform torch.matmul(nn.Parameter(a), nn.Parameter(b)) and assign its value to nn.Conv2d.weight, it makes a copy. Is there any method to perform the operator like torch.matmul but return a nn.Parameter as well?

pascal_notsawo · May 10, 2021, 10:34pm

In my case it works.
Look at the following code, mainly the values of a and b before and after the forward-backward pass + optimizer gradient step (I have directly extracted some code snippets from one of my text classification problems)

import torch
from torch import Tensor
import torch.nn.functional as F
from torch.nn.common_types import _size_2_t

class Conv2d(torch.nn.Conv2d) :
    def __init__(
        self,
        in_channels: int,
        out_channels: int,
        kernel_size: _size_2_t,
        stride: _size_2_t = 1,
        padding: _size_2_t = 0,
        dilation: _size_2_t = 1,
        groups: int = 1,
        bias: bool = True,
        padding_mode: str = 'zeros' 
    ) :
        super(Conv2d, self).__init__(
                    in_channels = in_channels,
                    out_channels = out_channels,
                    kernel_size = kernel_size,
                    stride = stride,
                    padding = padding,
                    dilation = dilation,
                    groups = groups,
                    bias = bias,
                    padding_mode = padding_mode
        )
        # self.weight.shape ~ n_filters x in_channels x filter_size x emb_dim
        filter_size, emb_dim = kernel_size
        assert self.weight.shape == torch.Size([n_filters, in_channels, filter_size, emb_dim]) # n_filters x in_channels x filter_size x emb_dim
        
        # Here I just want to make sure that AxB has the same dimensions as self.weight (It's up to you to make sure that it is the same on your side)
        torch.manual_seed(0) 
        intermediate_dim = 7
        A = torch.rand((n_filters, in_channels, filter_size, intermediate_dim))
        B = torch.rand((intermediate_dim, emb_dim))
        # AxB ~ n_filters x in_channels x filter_size x emb_dim)
        self.a = torch.nn.Parameter(A, requires_grad=True) 
        self.b = torch.nn.Parameter(B, requires_grad=True)
        # self.weight = torch.matmul(self.a, self.b)
    def forward(self, input: Tensor) -> Tensor:
        #return self._conv_forward(input, self.weight, self.bias)
        return self._conv_forward(input, torch.matmul(self.a, self.b), self.bias) 


## Models
in_channels = 1
n_filters = 2
emb_dim = 7
filter_size = 3
model = Conv2d(in_channels = in_channels, out_channels = n_filters, kernel_size = (filter_size, emb_dim)) 

n_labels = 3
pred_layer = torch.nn.Linear(n_filters, n_labels)

## Data
torch.manual_seed(0) 
bs, slen = 5, 6
x = torch.rand((bs, slen, emb_dim))
y = torch.empty(bs, dtype=torch.long).random_(n_labels)

# optimizer
#optimizer = torch.optim.Adam(list(model.parameters()) + list(pred_layer.parameters()), lr=1e-2)
# The update of the parameters of the classification layer is not of interest to us here
optimizer = torch.optim.Adam(model.parameters(), lr=1e-2) 

model.state_dict()
"""
OrderedDict([('weight',
              tensor([[[[-0.0464,  0.0466, -0.1422, -0.0112,  0.1562, -0.0224,  0.0061],
                        [-0.0188,  0.0442,  0.1388,  0.2067,  0.1386,  0.2072, -0.0158],
                        [-0.1960, -0.1035,  0.1486, -0.0014, -0.1085, -0.1672, -0.2042]]],
              
              
                      [[[-0.1842, -0.0443,  0.1197,  0.1180, -0.2105,  0.1361, -0.1708],
                        [-0.0461, -0.0885, -0.0420, -0.0428, -0.1958, -0.1884, -0.0341],
                        [ 0.0028, -0.0991,  0.0822, -0.1964, -0.0147,  0.1919, -0.0890]]]])),
             ('bias', tensor([0.1971, 0.0790])),
             ('a',
              tensor([[[[0.4963, 0.7682, 0.0885, 0.1320, 0.3074, 0.6341, 0.4901],
                        [0.8964, 0.4556, 0.6323, 0.3489, 0.4017, 0.0223, 0.1689],
                        [0.2939, 0.5185, 0.6977, 0.8000, 0.1610, 0.2823, 0.6816]]],
              
              
                      [[[0.9152, 0.3971, 0.8742, 0.4194, 0.5529, 0.9527, 0.0362],
                        [0.1852, 0.3734, 0.3051, 0.9320, 0.1759, 0.2698, 0.1507],
                        [0.0317, 0.2081, 0.9298, 0.7231, 0.7423, 0.5263, 0.2437]]]])),
             ('b',
              tensor([[0.5846, 0.0332, 0.1387, 0.2422, 0.8155, 0.7932, 0.2783],
                      [0.4820, 0.8198, 0.9971, 0.6984, 0.5675, 0.8352, 0.2056],
                      [0.5932, 0.1123, 0.1535, 0.2417, 0.7262, 0.7011, 0.2038],
                      [0.6511, 0.7745, 0.4369, 0.5191, 0.6159, 0.8102, 0.9801],
                      [0.1147, 0.3168, 0.6965, 0.9143, 0.9351, 0.9412, 0.5995],
                      [0.0652, 0.5460, 0.1872, 0.0340, 0.9442, 0.8802, 0.0012],
                      [0.5936, 0.4158, 0.4177, 0.2711, 0.6923, 0.2038, 0.6833]]))])
"""

## Zero all the gradients
optimizer.zero_grad()


## Forward pass
x = x.unsqueeze(dim=1) # bs x 1 x slen x emb_dim
conved = model(x) # bs x n_filters x https://pytorch.org/docs/stable/generated/torch.nn.Conv2d.html ?
conved = F.relu(conved).squeeze(3) # bs x n_filters x (slen - emb_dim - filter_size + 1) ? https://pytorch.org/docs/stable/generated/torch.nn.Conv2d.html
pooled = F.max_pool1d(conved, conved.shape[2]).squeeze(2) # bs x n_filters
pooled = F.dropout(pooled, p = 0.1) # bs x n_filters
y_pred = pred_layer(pooled)

## Loss and backward pass
loss = F.cross_entropy(input=y_pred, target=y) # tensor(13.4064, grad_fn=<NllLossBackward>)
loss.backward() 


## optimizer gradient step
optimizer.step()

model.state_dict()

"""
OrderedDict([('weight',
              tensor([[[[-0.0464,  0.0466, -0.1422, -0.0112,  0.1562, -0.0224,  0.0061],
                        [-0.0188,  0.0442,  0.1388,  0.2067,  0.1386,  0.2072, -0.0158],
                        [-0.1960, -0.1035,  0.1486, -0.0014, -0.1085, -0.1672, -0.2042]]],
              
              
                      [[[-0.1842, -0.0443,  0.1197,  0.1180, -0.2105,  0.1361, -0.1708],
                        [-0.0461, -0.0885, -0.0420, -0.0428, -0.1958, -0.1884, -0.0341],
                        [ 0.0028, -0.0991,  0.0822, -0.1964, -0.0147,  0.1919, -0.0890]]]])),
             ('bias', tensor([0.1871, 0.0690])),
             ('a',
              tensor([[[[0.4863, 0.7582, 0.0785, 0.1220, 0.2974, 0.6241, 0.4801],
                        [0.8864, 0.4456, 0.6223, 0.3389, 0.3917, 0.0123, 0.1589],
                        [0.2839, 0.5085, 0.6877, 0.7900, 0.1510, 0.2723, 0.6716]]],
              
              
                      [[[0.9052, 0.3871, 0.8642, 0.4094, 0.5429, 0.9427, 0.0262],
                        [0.1752, 0.3634, 0.2951, 0.9220, 0.1659, 0.2598, 0.1407],
                        [0.0217, 0.1981, 0.9198, 0.7131, 0.7323, 0.5163, 0.2337]]]])),
             ('b',
              tensor([[ 0.5746,  0.0232,  0.1287,  0.2322,  0.8055,  0.7832,  0.2683],
                      [ 0.4720,  0.8098,  0.9871,  0.6884,  0.5575,  0.8252,  0.1956],
                      [ 0.5832,  0.1023,  0.1435,  0.2317,  0.7162,  0.6911,  0.1938],
                      [ 0.6411,  0.7645,  0.4269,  0.5091,  0.6059,  0.8002,  0.9701],
                      [ 0.1047,  0.3068,  0.6865,  0.9043,  0.9251,  0.9312,  0.5895],
                      [ 0.0552,  0.5360,  0.1772,  0.0240,  0.9342,  0.8702, -0.0088],
                      [ 0.5836,  0.4058,  0.4077,  0.2611,  0.6823,  0.1938,  0.6733]]))])
"""

a and b have indeed changed.

YJHuang · May 11, 2021, 7:25am

pascal_notsawo:

def forward(self, input: Tensor) -> Tensor:
        #return self._conv_forward(input, self.weight, self.bias)
        return self._conv_forward(input, torch.matmul(self.a, self.b), self.bias)

Thanks, it works.
I realize that my My forward() function was defined as:

def forward(self, input: Tensor) -> Tensor:
        self.weight = torch.matmul(self.a, self.b)
        return self._conv_forward(input, self.weight, self.bias) 
        #return self._conv_forward(input, torch.matmul(self.a, self.b), self.bias)

and it will return an error message:

TypeError: cannot assign 'torch.FloatTensor' as parameter 'weight' (torch.nn.Parameter or None expected)

I don’t know what is the difference between the two forward functions. Only yours work! I really appreciate your help!!

pascal_notsawo · May 11, 2021, 9:16am

I explained it above when answering your second question in my first post.
self.weight = torch.matmul(self.a, self.b) creates a whole other variable, and it is this variable that is involved in the computation graph, not a and b. But then if you want to know how a and b would be updated, you can calculate it yourself after the backward :

a.grad = dloss/da = dloss/dweigths * dweigths/da = weigths.grad * f(b), with f(b) = b or b^T if I am not mistaken
a = optimizer(a, learning_rate, a.grad), for example a = a - learning_rate*a.grad

With self.weight = torch.matmul(self.a, self.b), it’s a bit like saying to the computation graph: hello dear graph, don’t worry about a and b when retropagating the gradient, they are now represented by self.weight (self). This fails because at this moment torch.matmul(self.a, self.b) is only a normal tensor, hence :

TypeError: cannot assign 'torch.FloatTensor' as parameter 'weight' (torch.nn.Parameter or None expected)

But you can do as below in the forward function so that the error disappears, but a and b won’t change anymore, but rather self.weight (I remind again and again that it is because it intervenes in the computation graph, and that a and b do not intervene anymore)

self.weight = torch.nn.Parameter(torch.matmul(self.a, self.b), requires_grad=True)
return self._conv_forward(input, self.weight, self.bias)

Do it as below (without self) and you will see that a and b will change :

weight = torch.matmul(self.a, self.b)
return self._conv_forward(input, weight, self.bias) # equivalent to return self._conv_forward(input, torch.matmul(self.a, self.b), self.bias)

Here it works because weight (without self) is just a bait, it’s just a temporary variable nothing more, the graph doesn’t even know it (if I’m not mistaken), but rather a and b. It doesn’t prevent a and b from intervening in the graph.

This famous computation graph …