Initialize weights using the matrix multiplication result from two nn.Parameter

I have two tensor matrix, A $\in R^{nxm})$, and B $\in R^{mx1}$

a = nn.Parameter(A, requires_grad=True)
b = nn.Parameter(B, requires_grad=True)

Is it possible to use the matrix multiplication result from A*B = C $\in R^{nx1}$ as the initialize weights for the nn.Linear layer?

linear_weights = nn.Parameter(torch.matmul(a,b), requires_grad=True)
linear_layer.weight = linear_weights

Another two questions to the gradient updates for parameters $a$ and $b$.

  1. Will the gradient updates the parameters $a$ and $b$ using above code?
  2. If I change the required_grad=False to Parameter $a$ at the beginning, after that the $linear_weights$ has set to requires_grad=True. Will the Parameter $a$ still be frozen?

For your questions, just do some test to get the answer. Here is the code that supports the answers below.

import torch
class Model(torch.nn.Module) :
    def __init__(self) :
        n, m = 1, 1
        A = torch.rand((n, m))
        B = torch.rand((m, 1))
        # AB ~ n x 1
        linear_layer = torch.nn.Linear(n, 1)

        flag = False
        if flag :
            # will raise this exception
            # TypeError: cannot assign 'torch.FloatTensor' as parameter 'weight' (torch.nn.Parameter or None expected)
            linear_layer.weight = torch.matmul(A,B)
        else :
            self.a = torch.nn.Parameter(A, requires_grad=True) # or A if you don't want to track the evolution of its value
            self.b = torch.nn.Parameter(B, requires_grad=True) # or B if you don't want to track the evolution of its value
            self.linear_weights = torch.nn.Parameter(torch.matmul(self.a,self.b), requires_grad=True)
            linear_layer.weight = self.linear_weights

        self.linear_layer = linear_layer
    def forward(self, x) :
        return self.linear_layer(x)  

def mae(y, y_pred) :
    """mean absolute error"""
    return (y - y_pred).abs().mean()

# data 
x = torch.tensor([1., 3, 3]).unsqueeze(dim=1) # unsqueeze to set batch_size to 3 ~~ [[1.], [3.], [3.]]
y = 2 * x
# model
model = Model()
# optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=1e-2)

OrderedDict([('a', tensor([[0.4963]])),
             ('b', tensor([[0.7682]])),
             ('linear_weights', tensor([[0.3812]])),
             ('linear_layer.weight', tensor([[0.3812]])),
             ('linear_layer.bias', tensor([-0.7359]))])

y_pred = model(x) # tensor([[-0.3547], [ 0.4078], [ 0.4078]], grad_fn=<AddmmBackward>)
loss = mae(y_pred, y) # tensor(4.5131, grad_fn=<MeanBackward0>)



OrderedDict([('a', tensor([[0.4963]])), # no change
             ('b', tensor([[0.7682]])), # no change
             ('linear_weights', tensor([[0.3912]])), # change, because is the same variable as linear_layer.weight 
             ('linear_layer.weight', tensor([[0.3912]])), # change
             ('linear_layer.bias', tensor([-0.7259]))]) # change
  • Is it possible to use the matrix multiplication result from A*B as the initialize weights for the nn.Linear layer?
    No, AxB is still a tensor; so if you do it you will get an error like TypeError: cannot assign 'torch.FloatTensor' as parameter 'weight' (torch.nn.Parameter or None expected)

  • Will the gradient updates the parameters a and b using above code?
    No, they are not involved in the computation graph.
    linear_weights, the result of their multiplication, is another full variable, which intervenes naturally in the graph through linear_layer.weight, with which it shares the same memory address (I prefer to say that it is the same variable as linear_layer.weight)
    If you want it to change, do this in forward: return torch.matmul(self.a,self.b)*x (they intervene directly in the computation graph, unlike above)
    Then, after the forward and backward pass, you will have :

OrderedDict([('a', tensor([[0.5063]])), # change
             ('b', tensor([[0.7782]])), # change
             ('linear_weights', tensor([[0.3812]])), # no change
             ('linear_layer.weight', tensor([[0.3812]])), # no change
             ('linear_layer.bias', tensor([-0.7359]))]) # no change
  • If I change the required_grad=False to Parameter a at the beginning, after that the linear_weights has set to requires_grad=True. Will the Parameter a still be frozen?
    Yes (see the code below)
n, m = 1, 1
A = torch.rand((n, m))
B = torch.rand((m, 1))
a = torch.nn.Parameter(A, requires_grad=False) # or A
b = torch.nn.Parameter(B, requires_grad=True) # or B
linear_weights = torch.nn.Parameter(torch.matmul(a,b), requires_grad=True)

Parameter containing:

Parameter containing:
tensor([[0.1949]], requires_grad=True)

What if I want to assign to Conv2d layer’ weights instead nn.Linear layer? Is it still possible to update the a and b?

As you said, in linear layer, we can apply torch.matmul in forward(),
How about using the conv2d operator with initialize weight by torch.matmul(a,b), can i also do something similar?

If you want a parameter to be updated, make sure that it is of type nn.Parameter, that its attribute requires_grad is equal to True and that it is part of the computation graph.


Thank you @pascal_notsawo,

Is it possible for me to initialize the weight of a 1x1 pointwise nn.Conv2d kernel by using the torch.matmul(a,b) result? I want the nn.Parameter(a), nn.Parameter(b) be update too.

So after the loss.backward() and optimizer.step(), I expect the torch.matmul(a’, b’), where a’ and b’ are the updated nn.Parameters, will have the same value as the updated 1x1 pointwise nn.Conv2d.weight.

You are welcome @YJHuang

It gets a little complicated.
If you look at the forward method of nn.Conv2d, you will notice this:

return self._conv_forward(input, self.weight, self.bias)

So try to inherit the nn.Conv2d class and modify the forward method by replacing self.weight with torch.matmul(self.a,self.b) and observe the parameter values as I did above.

I dare to believe that it will be ok, since going further down than that we will run into C++: the basic convolution operations of pytorch are implemented in C++ (pytorch/Convolution.cpp at 1465970a343e61f2f2b104859ca7f5d7e03f5d02 · pytorch/pytorch · GitHub). So one of the only alternatives if it doesn’t work is to do everything from scratch in python (forward method), by googling you can find stable implementations.


The example you provided is very easy to follow and to observe the parameter values. I will also check these links and try to inherit the nn.Conv2d class.
Thank you!!

It seems that the torch.matmul() will return a tensor value. So after I perform torch.matmul(nn.Parameter(a), nn.Parameter(b)) and assign its value to nn.Conv2d.weight, it makes a copy. Is there any method to perform the operator like torch.matmul but return a nn.Parameter as well?

In my case it works.
Look at the following code, mainly the values of a and b before and after the forward-backward pass + optimizer gradient step (I have directly extracted some code snippets from one of my text classification problems)

import torch
from torch import Tensor
import torch.nn.functional as F
from torch.nn.common_types import _size_2_t

class Conv2d(torch.nn.Conv2d) :
    def __init__(
        in_channels: int,
        out_channels: int,
        kernel_size: _size_2_t,
        stride: _size_2_t = 1,
        padding: _size_2_t = 0,
        dilation: _size_2_t = 1,
        groups: int = 1,
        bias: bool = True,
        padding_mode: str = 'zeros' 
    ) :
        super(Conv2d, self).__init__(
                    in_channels = in_channels,
                    out_channels = out_channels,
                    kernel_size = kernel_size,
                    stride = stride,
                    padding = padding,
                    dilation = dilation,
                    groups = groups,
                    bias = bias,
                    padding_mode = padding_mode
        # self.weight.shape ~ n_filters x in_channels x filter_size x emb_dim
        filter_size, emb_dim = kernel_size
        assert self.weight.shape == torch.Size([n_filters, in_channels, filter_size, emb_dim]) # n_filters x in_channels x filter_size x emb_dim
        # Here I just want to make sure that AxB has the same dimensions as self.weight (It's up to you to make sure that it is the same on your side)
        intermediate_dim = 7
        A = torch.rand((n_filters, in_channels, filter_size, intermediate_dim))
        B = torch.rand((intermediate_dim, emb_dim))
        # AxB ~ n_filters x in_channels x filter_size x emb_dim)
        self.a = torch.nn.Parameter(A, requires_grad=True) 
        self.b = torch.nn.Parameter(B, requires_grad=True)
        # self.weight = torch.matmul(self.a, self.b)
    def forward(self, input: Tensor) -> Tensor:
        #return self._conv_forward(input, self.weight, self.bias)
        return self._conv_forward(input, torch.matmul(self.a, self.b), self.bias) 

## Models
in_channels = 1
n_filters = 2
emb_dim = 7
filter_size = 3
model = Conv2d(in_channels = in_channels, out_channels = n_filters, kernel_size = (filter_size, emb_dim)) 

n_labels = 3
pred_layer = torch.nn.Linear(n_filters, n_labels)

## Data
bs, slen = 5, 6
x = torch.rand((bs, slen, emb_dim))
y = torch.empty(bs, dtype=torch.long).random_(n_labels)

# optimizer
#optimizer = torch.optim.Adam(list(model.parameters()) + list(pred_layer.parameters()), lr=1e-2)
# The update of the parameters of the classification layer is not of interest to us here
optimizer = torch.optim.Adam(model.parameters(), lr=1e-2) 

              tensor([[[[-0.0464,  0.0466, -0.1422, -0.0112,  0.1562, -0.0224,  0.0061],
                        [-0.0188,  0.0442,  0.1388,  0.2067,  0.1386,  0.2072, -0.0158],
                        [-0.1960, -0.1035,  0.1486, -0.0014, -0.1085, -0.1672, -0.2042]]],
                      [[[-0.1842, -0.0443,  0.1197,  0.1180, -0.2105,  0.1361, -0.1708],
                        [-0.0461, -0.0885, -0.0420, -0.0428, -0.1958, -0.1884, -0.0341],
                        [ 0.0028, -0.0991,  0.0822, -0.1964, -0.0147,  0.1919, -0.0890]]]])),
             ('bias', tensor([0.1971, 0.0790])),
              tensor([[[[0.4963, 0.7682, 0.0885, 0.1320, 0.3074, 0.6341, 0.4901],
                        [0.8964, 0.4556, 0.6323, 0.3489, 0.4017, 0.0223, 0.1689],
                        [0.2939, 0.5185, 0.6977, 0.8000, 0.1610, 0.2823, 0.6816]]],
                      [[[0.9152, 0.3971, 0.8742, 0.4194, 0.5529, 0.9527, 0.0362],
                        [0.1852, 0.3734, 0.3051, 0.9320, 0.1759, 0.2698, 0.1507],
                        [0.0317, 0.2081, 0.9298, 0.7231, 0.7423, 0.5263, 0.2437]]]])),
              tensor([[0.5846, 0.0332, 0.1387, 0.2422, 0.8155, 0.7932, 0.2783],
                      [0.4820, 0.8198, 0.9971, 0.6984, 0.5675, 0.8352, 0.2056],
                      [0.5932, 0.1123, 0.1535, 0.2417, 0.7262, 0.7011, 0.2038],
                      [0.6511, 0.7745, 0.4369, 0.5191, 0.6159, 0.8102, 0.9801],
                      [0.1147, 0.3168, 0.6965, 0.9143, 0.9351, 0.9412, 0.5995],
                      [0.0652, 0.5460, 0.1872, 0.0340, 0.9442, 0.8802, 0.0012],
                      [0.5936, 0.4158, 0.4177, 0.2711, 0.6923, 0.2038, 0.6833]]))])

## Zero all the gradients

## Forward pass
x = x.unsqueeze(dim=1) # bs x 1 x slen x emb_dim
conved = model(x) # bs x n_filters x ?
conved = F.relu(conved).squeeze(3) # bs x n_filters x (slen - emb_dim - filter_size + 1) ?
pooled = F.max_pool1d(conved, conved.shape[2]).squeeze(2) # bs x n_filters
pooled = F.dropout(pooled, p = 0.1) # bs x n_filters
y_pred = pred_layer(pooled)

## Loss and backward pass
loss = F.cross_entropy(input=y_pred, target=y) # tensor(13.4064, grad_fn=<NllLossBackward>)

## optimizer gradient step


              tensor([[[[-0.0464,  0.0466, -0.1422, -0.0112,  0.1562, -0.0224,  0.0061],
                        [-0.0188,  0.0442,  0.1388,  0.2067,  0.1386,  0.2072, -0.0158],
                        [-0.1960, -0.1035,  0.1486, -0.0014, -0.1085, -0.1672, -0.2042]]],
                      [[[-0.1842, -0.0443,  0.1197,  0.1180, -0.2105,  0.1361, -0.1708],
                        [-0.0461, -0.0885, -0.0420, -0.0428, -0.1958, -0.1884, -0.0341],
                        [ 0.0028, -0.0991,  0.0822, -0.1964, -0.0147,  0.1919, -0.0890]]]])),
             ('bias', tensor([0.1871, 0.0690])),
              tensor([[[[0.4863, 0.7582, 0.0785, 0.1220, 0.2974, 0.6241, 0.4801],
                        [0.8864, 0.4456, 0.6223, 0.3389, 0.3917, 0.0123, 0.1589],
                        [0.2839, 0.5085, 0.6877, 0.7900, 0.1510, 0.2723, 0.6716]]],
                      [[[0.9052, 0.3871, 0.8642, 0.4094, 0.5429, 0.9427, 0.0262],
                        [0.1752, 0.3634, 0.2951, 0.9220, 0.1659, 0.2598, 0.1407],
                        [0.0217, 0.1981, 0.9198, 0.7131, 0.7323, 0.5163, 0.2337]]]])),
              tensor([[ 0.5746,  0.0232,  0.1287,  0.2322,  0.8055,  0.7832,  0.2683],
                      [ 0.4720,  0.8098,  0.9871,  0.6884,  0.5575,  0.8252,  0.1956],
                      [ 0.5832,  0.1023,  0.1435,  0.2317,  0.7162,  0.6911,  0.1938],
                      [ 0.6411,  0.7645,  0.4269,  0.5091,  0.6059,  0.8002,  0.9701],
                      [ 0.1047,  0.3068,  0.6865,  0.9043,  0.9251,  0.9312,  0.5895],
                      [ 0.0552,  0.5360,  0.1772,  0.0240,  0.9342,  0.8702, -0.0088],
                      [ 0.5836,  0.4058,  0.4077,  0.2611,  0.6823,  0.1938,  0.6733]]))])

a and b have indeed changed.


Thanks, it works.
I realize that my My forward() function was defined as:

def forward(self, input: Tensor) -> Tensor:
        self.weight = torch.matmul(self.a, self.b)
        return self._conv_forward(input, self.weight, self.bias) 
        #return self._conv_forward(input, torch.matmul(self.a, self.b), self.bias) 

and it will return an error message:

TypeError: cannot assign 'torch.FloatTensor' as parameter 'weight' (torch.nn.Parameter or None expected)

I don’t know what is the difference between the two forward functions. Only yours work! I really appreciate your help!!

I explained it above when answering your second question in my first post.
self.weight = torch.matmul(self.a, self.b) creates a whole other variable, and it is this variable that is involved in the computation graph, not a and b. But then if you want to know how a and b would be updated, you can calculate it yourself after the backward :

a.grad = dloss/da = dloss/dweigths * dweigths/da = weigths.grad * f(b), with f(b) = b or b^T if I am not mistaken
a = optimizer(a, learning_rate, a.grad), for example a = a - learning_rate*a.grad 

With self.weight = torch.matmul(self.a, self.b), it’s a bit like saying to the computation graph: hello dear graph, don’t worry about a and b when retropagating the gradient, they are now represented by self.weight (self). This fails because at this moment torch.matmul(self.a, self.b) is only a normal tensor, hence :

TypeError: cannot assign 'torch.FloatTensor' as parameter 'weight' (torch.nn.Parameter or None expected)

But you can do as below in the forward function so that the error disappears, but a and b won’t change anymore, but rather self.weight (I remind again and again that it is because it intervenes in the computation graph, and that a and b do not intervene anymore)

self.weight = torch.nn.Parameter(torch.matmul(self.a, self.b), requires_grad=True)
return self._conv_forward(input, self.weight, self.bias)

Do it as below (without self) and you will see that a and b will change :

weight = torch.matmul(self.a, self.b)
return self._conv_forward(input, weight, self.bias) # equivalent to return self._conv_forward(input, torch.matmul(self.a, self.b), self.bias)

Here it works because weight (without self) is just a bait, it’s just a temporary variable nothing more, the graph doesn’t even know it (if I’m not mistaken), but rather a and b. It doesn’t prevent a and b from intervening in the graph.

:innocent: :innocent:This famous computation graph …

1 Like