How to correctly impose a weight constraint

I have the following model;

X_i = (c_1i X_1, …, c_di X_d) for i=1,…,d;

This leads to a matrix C of d*d parameters. For reasons that are tedious to explain, I, therefore, have a d x d matrix of estimated parameters with

X^est_i = (c^est_1i X_1, …, c^est_di X_d)

and choose my loss to be

max_{i=1,…,d}(X^est_i / X_i)

Observe that all values in C as well as X are non-negative; This problem should be very easy to solve as it is convex (reducing a value c_ji never increases the loss)

Now I want to implement that the sum of all entries of C to be equal (or larger) to a certain threshold, let us call is eps;

What is the best way to do it? I found three solutions that all seem to fail; Either the estimator gets stuck or goes to infinity; The 3 approaches I thought of:

  1. only define d^2-1 values and then just assign the last value as torch.abs(eps-sum(C)) - then the sum of all values is always equal to eps;

  2. Define d^2 values and scale up the values in each iteration by multiplying each weight by eps/sum(C)

  3. give a penalty lambda*torch.abs(eps-sum(C)) - for lambda large enough, this will enforce the sum of the values to be eps;

So I tried all of these approaches but all of them seem to fail; Either they get stuck or they converge to infitinity;

Below I implemented the first approach

import torch
import torch.nn as nn
import numpy as np
import scipy.stats as st
import torch.optim as optim
import numpy as np
import copy

class Network(nn.Module):
    def __init__(self, dim):
        super(Network, self).__init__()
        
        d=dim
        self.linears = nn.ModuleList([nn.Linear(1, d, bias=False) for i in range(d-1)])
        self.final_layer = nn.Linear(1, (d-1), bias=False)
        self.dim = dim

    def forward(self, x): 
        d = self.dim
        y=torch.zeros((d,d,x.size()[0]))


        for i, l in enumerate(self.linears):
            y[i,:,:] = torch.transpose(l(x[:,i].view(-1,1)),0,1)
        
        y[d-1,1:d,:] = torch.transpose(self.final_layer(x[:,d-1].view(-1,1)),0,1)
        reg=self.weight_constraint()
        
        
        y[d-1,0,:]=torch.abs(reg-lambda1)*x[:,d-1]
        y=torch.max(y, axis=0).values
        
        return torch.transpose(y,0,1)
    
    
        
    def weight_constraint(self):  
        reg=0
        
        for i, l in enumerate(self.linears):
                reg+=torch.sum(l.weight)
        reg+=torch.sum(self.final_layer.weight)
        
        return reg

        
    
    
def custom_loss(output, target):
    loss = torch.max(output/target)
    return loss


np.random.seed(seed=1)
torch.manual_seed(1)

d=3
n=100
model = Network(dim=d)

C=np.array([[1,0.5,0.3],[0,1,0],[0,0,1]])


Z=np.random.lognormal( 0, 3, size=(n,d))

X=np.zeros((n,d))

for i in range(n):
    for j in range(d):
        X[i,j]=np.max(C[:,j]*Z[i,:])
        


lambda1=3+0.5+0.3

optimizer = optim.LBFGS(model.parameters(), lr=0.06)

for t in range(100000):

    
    def closure():
        
        x_pred = model(torch.Tensor(X))
        optimizer.zero_grad()
        loss  = custom_loss(x_pred, torch.Tensor(X))
        
        
        loss.backward()
        
        for i, layer in enumerate(model.linears):
            with torch.no_grad():
                model.linears[i].weight.copy_ (model.linears[i].weight.data.clamp(min=0))
        with torch.no_grad():
            model.final_layer.weight.copy_ (model.final_layer.weight.data.clamp(min=0))
        
        
        print(loss)
        


        
        return loss

    optimizer.step(closure)
    
#Testing if true C matrix indeed gives lower penalty
model2=copy.deepcopy(model)

for j,l in enumerate(model2.linears):
    for i in range(d):
        with torch.no_grad():
            l.weight[i]=C[j,i]
for i in range(d-1):           
     with torch.no_grad():
         model2.final_layer.weight[i]=C[d-1,i+1]

x_pred=model2(torch.Tensor(X))
loss  = custom_loss(x_pred, torch.Tensor(X))
print("Loss Value for the true C Matrix: ", loss)

You can see that the true loss value is just 1, but the minimum it finds is far away from 1; Let me quickly explain what I am doing;

I define (d-1) layers of size d, one layer of size (d-1) so I have exactly d^2-1 variables;

Then the forward function just tries to calculate X^est based on these layers and weight_contraint() just calculates the sum of all values of the layer;

The rest should be basic; I generate data, I run the pytorch algorithm; In the end, I test if I set up the forward function and the network correctly; I use the true C values to show that it indeed gives error 1;

Any idea how I can properly set up this weight constraint or is it impossible?