Linear constraints on trainable network parameters

Mathieu · February 6, 2025, 9:23pm

I am defining additional trainable parameters (besides the weights and biases) as nn.ParameterDict(...). Lets denote these additional parameters as X.

During training, the linear inequality constraint A*X > 0 must be satisfied at every epoch. Otherwise, I may not be able to calculate a loss.

Are there built-in solution approaches to handle this? I am using the Adam optimizer, but could also switch to a different one.
(Note: I would like to avoid introducing a loss term for the constraint vialation since I already have some loss terms which I have to weigh appropriately)

KFrank · February 7, 2025, 6:18pm

Hi Mathieu!

Do you really mean “epoch” here? If you really need your constraint to be satisfied
in order to compute (and then presumably to backpropagate) your loss, I would think
that you would need your constraint to be satisfied for each forward / backward pass
which would typically mean for every batch (not only for an entire epoch).

I am not aware of anything built in for your use case.

I would suggest “projecting” X back to where it satisfies the constraints after
every optimizer step.

I assume that this means A @ X > 0 (that is, matrix multiplication) and that A has
shape [m, n] with m < n (that is, there are fewer individual constraints than there
are elements of X).

The constraint violation is V = (A @ X).min (torch.zeros (1)). Then:

X_new = X - torch.linalg.lstsq (A, V.unsqueeze (-1)).solution.squeeze()

will be the closest X_new to X that satisfies A @ X_new >= 0.

Note that if you require that A @ X be strictly greater than zero, not just greater
than or equal to zero (up to some round-off error), you will have to replace your
constraint with A @ X > eps, for some small, positive eps.

If I understand your use case, adding a loss term to impose the constraint – whether
as a violation penalty or a Lagrange multiplier – won’t work in any event, as the loss
term will drive X towards satisfying the constraint as the optimization proceeds, but
will not cause the constraint to be satisfied step-by-step during the optimization. So
if you really need the constraint to be satisfied in order to calculate your loss, you must
reimpose the constraint before each forward / backward pass, which is to say, after
each optimization step.

Best.

K. Frank

Mathieu · February 8, 2025, 8:54am

Thanks for the detailed answer Frank. Let me go through your statements.

Do you really mean “epoch” here? If you really need your constraint to be satisfied
in order to compute (and then presumably to backpropagate) your loss, I would think
that you would need your constraint to be satisfied for each forward / backward pass
which would typically mean for every batch (not only for an entire epoch).

for epoch in range(epochs):
    optimizer.zero_grad()

    # Forward pass with all samples
    output = model(inputs)

    # Compute loss and backpropagate
    loss = Loss_function(output,...)
    loss.backward()
    optimizer.step()

What I am currently doing is to pass all samples in one batch to the model. Here, inputs has shape [N, dim] where N denotes the number of samples and dim the sample dimension (input dimension of the NN). Both N and dim are small for my problems, so I am passing all samples in one go to the model. As a consequence – correct me if I am wrong – in that scenario “batch” and “epoch” mean the same thing, right?

If I understand your use case, adding a loss term to impose the constraint – whether
as a violation penalty or a Lagrange multiplier – won’t work in any event, as the loss
term will drive X towards satisfying the constraint as the optimization proceeds, but
will not cause the constraint to be satisfied step-by-step during the optimization. So
if you really need the constraint to be satisfied in order to calculate your loss, you must
reimpose the constraint before each forward / backward pass, which is to say, after
each optimization step.

Everything what you said before regarding A and its shape is correct. Sorry for my sloppy notation! Also you are mentioning a good point here. The optimizer will satisfy the constraint while converging to the optimum, but this means that the constraints can be violated at intermediate epochs and I may run into trouble. So projecting onto the feasible set is most likely the way to go. But I am not sure how to include the projection in the epoch loop correctly.

for epoch in range(epochs):
    optimizer.zero_grad()

    # Forward pass with all samples (X denote the add. params s.t A @ x > 0)
    output, X = model(inputs)
    
    # --> projection should happen here?
    X_new = X - torch.linalg.lstsq (A, V.unsqueeze (-1)).solution.squeeze() 

    # Compute loss and backpropagate
    loss = Loss_function(output, X)
    loss.backward()
    optimizer.step()

Should the projection on the feasible set (using torch.linalg.lstsq) happen after the call to the model, like I sketched it?
If so, does this projection cause problems for the optimizer when calculating the gradients during loss.backward()? Specifically, do we have to “inform” the optimizer about the projection step to make sure the gradients are correct?

Best,

KFrank · February 8, 2025, 6:43pm

Hi Mathieu!

Yes, if you have only one batch per epoch (and more relevant to the larger
discussion, one optimizer step per epoch), then “batch” and “epoch” become
the same.

You can put the projection in a number of different, but equivalent places.

First, let me assume that X is a trainable parameter (a leaf variable with
requires_grad = True) that is being optimized by optimizer.

So output, X = model (inputs) is highly suspicious. X should not be
computed (and therefore presumably not be returned) by model. Think
through your logic here. If X has to be returned by model, you’re probably
doing something wrong

I would to it like this:

...
loss = Loss_function (output, X)
loss.backward()
optimizer.step()
with torch.no_grad():
   V = (A @ X).min (torch.zeros (1))
   X.sub_ (torch.linalg.lstsq (A, V.unsqueeze (-1)).solution.squeeze())

Note, it is important that X is modified in place, and is modified under the protection
of a with torch.no_grad(): block.

The logic is as follows: optimizer.step() modifies X, possibly causing it to violate
the constraint. Stylistically, I prefer to reimpose the constraint immediately after
modifying X because I think it makes it clearer what’s going on. But logically, you
can reimpose the constraint any time before X is used again.

Note, if you call .lstsq() many times without A changing (and if this part of the
processing matters in terms of performance), it may be cheaper to pre-compute the
pseudoinverse of A once, A_inv = A.pinverse(), and reuse it:

   X.sub_ (A_inv @ V)

(However, lstsq() will be cheaper unless you reuse A_inv multiple times.)

No, you don’t need to tell the optimizer – you’re overriding the optimizer by brute
force. If the optimizer moves X out of compliance with the constraint and you
reimpose is, it is likely that on the next optimization step, the optimizer will similarly
move X out of compliance. But that’s okay – you just reimpose the constraint
again. In some sense the optimizer and constraint are working at cross purposes,
but this is generally not a problem.

Best.

K. Frank

Mathieu · February 9, 2025, 8:50am

So output, X = model (inputs) is highly suspicious. X should not be
computed (and therefore presumably not be returned) by model . Think
through your logic here. If X has to be returned by model , you’re probably
doing something wrong

You are absolutely correct. X is a trainable parameter which gets updated by the optimizer and not computed from within the model.

...
loss = Loss_function (output, X)
loss.backward()
optimizer.step()
with torch.no_grad():
   V = (A @ X).min (torch.zeros (1))
   X.sub_ (torch.linalg.lstsq (A, V.unsqueeze (-1)).solution.squeeze())

Note, it is important that X is modified in place, and is modified under the protection
of a with torch.no_grad(): block.

It is not clear to me why the projection should happen within a with torch.no_grad(): block. Consider the case without projection where our loss function computes f(X). During backpropagation, the optimizer (e.g., Adam) computes the gradient of the loss w.r.t X, that is, df(X) / dX =: g(X).
Lets now include the projection on the feasible set where P(X) is my notation for the projection. Hence, our loss function now computes f(P(X)) and the analytical gradient reads grad(P) * g(P(X)) where grad(P) is related to the pseudoinverse of A.

My understanding of with torch.no_grad(): is that everything what gets calculated in this block is not accounted for in the gradient calculation. In other words, we created a situation computing a loss f(P(X)) and a gradient g(P(X)) but the gradient is missing the term grad(P), hence deviates from the analytical gradient.
Is this correct?

Related to that …

No, you don’t need to tell the optimizer – you’re overriding the optimizer by brute
force. If the optimizer moves X out of compliance with the constraint and you
reimpose is, it is likely that on the next optimization step, the optimizer will similarly
move X out of compliance. But that’s okay – you just reimpose the constraint
again. In some sense the optimizer and constraint are working at cross purposes,
but this is generally not a problem.

… should the optimizer and the constraints not work together (rater than at cross purposes)? For instance: The optimizer moves X away from the feasible set, then we reimpose the constraint via brute force. So the optimizer thinks it moved X somewhere, but in reality, it has not. So, similar to what I asked above, why not including the term grad(P) in the backpropagation, thereby letting the optimizer know that we modified X?

I am new to pytorch and stochastic optimizers such as Adam. My background is more on gradient-based optimizers (e.g trust-region solvers) providing analytical gradients along with the loss value. But maybe the pytorch optimizers do not rely on analytical gradients.

Best,

KFrank · February 9, 2025, 7:20pm

Hi Mathieu!

Any modification of the trainable parameter X has to occur under .no_grad()
protection. So if you set X itself to its projected value, you need the block.

Yes, you could do this instead. Writing out some of the steps:

X_proj = P (X)
loss = Loss_function (output, X_proj)
loss.backward()

(Note, you could implement X_proj = P (X) as a parametrization. This is really
just a convenience layer that doesn’t differ substantively from imposing the constraint
“by hand.”)

This is correct.

There are use cases where not including the grad (P) term empirically works.
There are also use cases where including grad (P) empirically works. I am not
aware of any head-to-head comparisons showing that one approach in generally
better than the other,

You could try both on your problem and see whether you get better training with one
or the other.

(Note that when you impose a constraint, rather than add a constraint penalty to
your loss function, you open up the possibility that certain “modes” of the unconstrained
parameter could start drifting off to infinity, potentially leading to excessive round-off
error or overflow. In such a case it would be prudent to add some regularization such
as weight decay to your training. Mild weight decay is generally innocuous and is
often helpful for other reasons.)

Just a semantic quibble: Pytorch optimizers “rely” on mathematically correct, up to
round-off error, gradients computed numerically by autograd, but not technically on
analytical gradients.

But, depending on the details of the use case, those gradients don’t always need to
be the actual (numerically correct) gradients. They can be approximate or sometimes
willfully modified. For example, the gradients computed for one batch will differ from
the gradients computed from the next batch in the training loop. Depending on your
point of view, you can consider this to be a tolerable approximation or to be a useful
source of stochasticity that improves training and the performance of your final model.

As an aside, Adam is a good optimizer that I often use. But when starting with a
new problem, or experimenting with things like including grad (P) or not, I generally
start with plain-vanilla SGD (with no momentum), tune the learning rate a little, and
try adding things like momentum and weight decay. Only after I get a general sense
of how training progresses, might I switch to Adam. At a minimum, SGD is easier to
reason about.

Best.

K. Frank

Mathieu · February 10, 2025, 8:01am

Thanks for your detailed and valuable answer again! Just two small questions:

There are use cases where not including the grad (P) term empirically works.
There are also use cases where including grad (P) empirically works. I am not
aware of any head-to-head comparisons showing that one approach in generally
better than the other,

You could try both on your problem and see whether you get better training with one
or the other.

If I understood correctly, then there is no chance to include grad(P) as I have to write the projection in a with torch.no_grad():. Hence, the projection will not be tracked and not be accounted for in the grad calculation.

(Note that when you impose a constraint, rather than add a constraint penalty to
your loss function, you open up the possibility that certain “modes” of the unconstrained
parameter could start drifting off to infinity, potentially leading to excessive round-off
error or overflow. In such a case it would be prudent to add some regularization such
as weight decay to your training. Mild weight decay is generally innocuous and is
often helpful for other reasons.)

Good point. To prevent this, is it possible to impose bounds on X without in-place modifications? Clearly, brute forcing them (like the projection logic) could be realized after every optimizer step via

with torch.no_grad():
     X_new[:] = X.clamp(lb,ub)

There is also the sigmoid trick:

X = torch.tensor(value)
X = torch.nn.Parameter(lb + torch.sigmoid(X) * (ub - lb),  requires_grad=True)

But is there a direct way to tell the optimizer: “when optimizing X, make sure that optimizer.step() respects the bounds”. If that were possible, then we would not open up the possibility that certain modes of the unconstrained problem drift to infinity.

Best,

Mathieu · June 21, 2025, 8:19am

@KFrank Not sure if you are still following this discussion here. If so, I appreciated your advise on the following:

Based on what you stated, we can implement the projection as

    with torch.no_grad():
         V = (A @ X).min (torch.zeros (1)) 
         param.sub_ (A.pinverse() @ V)

This means we are building the pseudo-inverse of the entire A matrix. To project onto the set of active constraints, would it not be required to build the pseudo-inverse of A_sub only, where A_sub contains only the rows of A associated with active constraints ?

KFrank · June 23, 2025, 12:49am

Hi Mathieu!

Mathieu:

    with torch.no_grad():
         V = (A @ X).min (torch.zeros (1)) 
         param.sub_ (A.pinverse() @ V)
This means we are building the pseudo-inverse of the entire A matrix. To project onto the set of active constraints, would it not be required to build the pseudo-inverse of A_sub only, where A_sub contains only the rows of A associated with active constraints ?

Yes, this is correct.

That is why I suggested in my earlier post using torch.linalg.lstsq (A, V) (with
appropriate squeeze()s) instead of A.pinverse() @ V. Very crudely, linalg.lstsq()
is sort of like performing pinverse(), but just for the active constraints.

(I don’t think this is your use case, but if you were to call torch.linalg.lstsq (A, V)
for multiple values of V, but for the same value of A, then it could be cheaper to call
A.pinverse() once and reuse that result for the multiple values of V.)

Best.

K. Frank

Mathieu · June 23, 2025, 8:42am

@KFrank Thank you!

The projection costs are indeed negligible in my case. X has not more than 10-50 elements.

You also mentioned a subtle detail:

What I am observing is that the optimizer moves the parameter far away from the constrained region because it does not know about their existence since we are overriding the parameters by brute force. To address this, I believe it is prudent to add a penalty term

L_p= ||X - X_proj ||^2

where X and X_proj are the parameters before the projection and the projected parameters, respectively. But the calculation of X_proj occurs within with torch.no_grad():, essentially translating into dX_proj / dX = 0. But still, by means of L_p, we are defining a “target” we want the optimizer to stick close to and dX_proj / dX = 0 may not be a problem.

What do you think about the argumentation and idea of having the projection loss term?
Can you think of easier approaches (having less loss terms is in general a good idea) to guide the optimizer to move not too far away from the constrained space?

KFrank · June 23, 2025, 5:52pm

Hi Mathieu!

Do you have any evidence that having the optimizer move the parameter outside of
the constraint causes any actual problem? I would typically advise against adding the
complication of a penalty term unless you have evidence that it is necessary.

From what you said above, you need in any event the brute-force projection – the penalty
term in isolation won’t be enough. Unless you can show that just using the projection
doesn’t work, why try to fix something that isn’t broken?

As an aside (along the lines of don’t let your “fix” break something else), let’s say that
prior to projection, your X satisfies the constraint, that is, X = X_proj. Let’s also say
the optimizer step, without the penalty, would give some X_new as the updated X. And
just to make the discussion simpler, let’s say that X_new itself satisfies the constraint
(without projection). If you turn on the penalty, the penalty will cause X_newto be closer
to X than it “should” have been, giving you a sub-optimal update. Would the presumed
benefit of the penalty outweigh this disadvantage?

Best.

K. Frank