How to set the diagonal entries of Linear layer weight matrix to be always negative?

I want to create a neural network with a single linear layer whose weight matrix has diagonal entries that will always be negative (even during training). I tried various means, but nothing seems to work. The closest I have got to is the below code:

class IdentityMask(nn.Module):
    def __init__(self, n):
        super(IdentityMask, self).__init__()
        self.n = n
        self.weight = nn.Parameter(torch.ones(n, n), requires_grad=False)
        torch.diagonal(self.weight).fill_(-1.0)

class LinearLayer(nn.Module):
    def __init__(self, n):
        super(LinearLayer, self).__init__()
        self.identity_mask = IdentityMask(n)
        self.linear = nn.Linear(n, n, bias=False)
        nn.init.normal_(self.linear.weight, mean=0.0, std=0.01)

    def forward(self, x):
        self.linear.weight.data *= self.identity_mask.weight
        x = self.linear(x)
        return x

What should I do to get the diagonal entries of self.linear.weight to be negative?

OK. I figured out what is the problem. The line self.linear.weight.data *= self.identity_mask.weight multiplies the diagonal entries by -1 for every forward pass. But what we want is diagonal entry multiplied by -1 only when the entry turns positive. So I have modified the code as follows:

class LinearLayer(nn.Module):
    def __init__(self, n):
        super(LinearLayer, self).__init__()
        self.linear = nn.Linear(n, n, bias=False)
        nn.init.normal_(self.linear.weight, mean=0.0, std=0.01)

    def forward(self, x):
        diagonal = torch.diag(self.linear.weight)
        for i in range(diagonal.size(0)):
            if diagonal[i] > 0:
                diagonal[i] *= -1.0
                self.linear.weight.data[i][i] = diagonal[i]
        print("Diagonal entries:", diagonal)
        x = self.linear(x)
        return x

No need for a mask layer. This code works. But if there are any other better ways to achieve my objective, please let me know.

note that a more efficient implementation of this can be done:

import torch

a = torch.randn(10, 10)

print(a.diagonal())
a.diagonal().mul_(a.diagonal().sign())
print(a.diagonal())

Also you can use with torch.no_grad() instead of .data.

Finally this looks like a good candidate for a Parametrization (torch.nn.utils.parametrize.register_parametrization — PyTorch 2.0 documentation) if you want to be able to do this without having to rewrite a Module by hand.

Hi Bala (and Alban)!

My intuition is that it is better to smoothly map an unconstrained trainable
parameter (that runs over (-inf, inf)) to a new tensor whose diagonal
is negative (and runs over (-inf, 0.0), rather than brute-force flip the
sign of the diagonal. It is straightforward and conceptually satisfying to
train the unconstrained parameter and understand the negative-diagonal
as an intermediate result.

Suppose during training your optimizer moves a slightly negative diagonal
entry to a slightly positive value. You then flip it back to negative, But on
the next iteration, the optimizer moves it back to a positive value. While
conceptually acceptable, it just seems to me that this is likely to throw a
little bit of sand in the optimization process (and possibly confuse fancier
optimizers such as Adam),

-exp() is a well-behaved function maps to strictly negative values.

Here is an illustration:

>>> import torch
>>> torch.__version__
'2.0.0'
>>>
>>> _ = torch.manual_seed (2023)
>>>
>>> preWeight = torch.randn (5, 5, requires_grad = True)   # unconstrained trainable parameter
>>> preWeight                                              # unconstrained diagonal -- can be positive
tensor([[ 0.4305, -0.3499,  0.4749,  0.9041, -0.7021],
        [ 1.5963,  0.4228, -0.6940,  0.9672, -0.5319],
        [ 0.8088, -0.1603,  0.8184, -0.6093,  0.8177],
        [ 0.1459, -0.9558, -1.3761,  1.3246, -0.0744],
        [ 0.5472,  1.6779,  0.8275, -1.0542, -0.7374]], requires_grad=True)
>>> weight = preWeight.clone()
>>> weight.diagonal().copy_ (-preWeight.diagonal().exp())
tensor([-1.5380, -1.5262, -2.2668, -3.7607, -0.4784],
       grad_fn=<AsStridedBackward0>)
>>> weight                                                 # derived weight tensor with negative diagonal
tensor([[-1.5380, -0.3499,  0.4749,  0.9041, -0.7021],
        [ 1.5963, -1.5262, -0.6940,  0.9672, -0.5319],
        [ 0.8088, -0.1603, -2.2668, -0.6093,  0.8177],
        [ 0.1459, -0.9558, -1.3761, -3.7607, -0.0744],
        [ 0.5472,  1.6779,  0.8275, -1.0542, -0.4784]], grad_fn=<CopySlices>)
>>> x = torch.randn (5, 5)
>>> (weight @ x).sum().backward()
>>> preWeight.grad                                         # gradients flow back to trainable parameter
tensor([[-1.2142,  0.9742,  3.3650, -1.4189,  2.5436],
        [ 0.7895, -1.4869,  3.3650, -1.4189,  2.5436],
        [ 0.7895,  0.9742, -7.6279, -1.4189,  2.5436],
        [ 0.7895,  0.9742,  3.3650,  5.3361,  2.5436],
        [ 0.7895,  0.9742,  3.3650, -1.4189, -1.2168]])

(Also, as Alban notes, you could register such a mapping as a
parameterization.)

Best.

K. Frank

1 Like

Hi,

I am revisiting this problem after a long time. I am essentially trying to follow the idea of using a smooth map ( -exp() ) applied to the diagonal elements, suggested by KFrank. But now I am getting a different error that seems to do with autograd and back propagation. The error that I get is: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward.

Here is the new code:

class NegDiagLinear(nn.Module):
    def __init__(self, in_features, out_features):
       super(NegDiagLinear, self).__init__()
       self.in_features = in_features
       self.out_features = out_features
       self.pre_weight = nn.Parameter(torch.Tensor(out_features, in_features))
       nn.init.normal_(self.pre_weight, mean=0, std=0.01)
       self.weight = self.pre_weight.clone()
    
    def forward(self, input):
       self.weight.diagonal().copy_ (-self.pre_weight.diagonal().exp())
       return input @ self.weight

class simple_model(nn.Module):
    def __init__(self):
       super().__init__()
       self.linear = NegDiagLinear(2,2)

    def forward(self, x):
       return self.linear(x)

model = simple_model()

define_criterion = torch.nn.MSELoss()

SGD_optimizer = torch.optim.SGD(model.parameters(), lr=0.001)

for epoch in range(18):
   SGD_optimizer.zero_grad()
   predict_y = model(x) 
   loss = define_criterion(predict_y, y)  
   loss.backward() 
   SGD_optimizer.step()
   print('epoch {}, loss function {}'.format(epoch, loss.item()))

My initial guess is that this is due to clone() applied to self.pre_weight(). So I dabbled around this trying to see if I could get around this problem. But I can’t seem to get rid. Please help me to solve this problem. If you could also provide an explanation of the cause of this error, it will be nice.

Hi Bala!

Your problem is that:

      self.weight = self.pre_weight.clone()

creates the parts of the computation graph that connects the off-diagonal
elements of weight to pre_weight only once (in __init__()).

Then:

       self.weight.diagonal().copy_ (-self.pre_weight.diagonal().exp())

creates the computation graph for the diagonal elements every time you call
forward.

But your first call to .backward() frees the whole computation graph, including
that for the off-diagonal elements, and the off-diagonal piece is never rebuilt. So
your second call to .backward() raises the “backward through the graph a second
time” error.

Try:

    def forward(self, input):
       weight = self.pre_weight.clone()
       weight.diagonal().copy_ (-self.pre_weight.diagonal().exp())
       return input @ weight

forward() now rebuilds the whole computation graph – both the off-diagonal and
diagonal parts – every time it is called.

Note that weight is no longer a property of NegDiagLinear – it’s now just a
local variable in forward().

Best.

K. Frank