The tool torch.nn.utils.parametrize allows to register parametrizations for the weights of a layer. For instance, it allows us to factor the weights of layer into triangular matrices so that the layer is symmetric. Does autograd compute gradients for the parametrized factors? That is, do optimization steps keep the layer symmetric?

Hey!

the gradients will indeed “flow” all the way to the new Parameters (before the parametrization is applied). And the optimization is then applied pre-parametrization (since these are the new parameters).

This means that your parametrization will re-run after the update and thus the property provided by it will be preserved indeed.

But in that way, the optimization step by itself does not seem to preserve the restriction imposed by the parametrization. The restriction will be imposed by applying the parametrization again after optimization. In this case, autograd does not calculate the gradient for the factors of the parametrization.

Hi André!

This is correct. But I do believe that pytorch’s scheme for optimizing

`model.parametrizations.weight.original`

, that is, your model’s

“original” unconstrained `Parameter`

, works correctly.*

Yes, this is also correct.

No, not in the sense that an actual layer weight stored in memory is kept

symmetric (or otherwise respects the constraint that your parametrization

imposes).

But what would it take to do this?

Consider a parameter that is a single two-dimensional point, `[x, y]`

,

that is constrained to have unit length. That is to say, the parameter

is a single point on a circle of radius one.

(You can consider this to be a baby version of a matrix that is constrained

to be orthonormal for which similar, but more nuanced issues arise.)

When autograd computes the gradient of some loss function with respect

to `[x, y]`

, there is no reason for that gradient to lie tangent to the unit

circle to which `[x, y]`

is constrained. You could add a feature so that

when you write your parametrization you also write code to project the

gradient to be tangent to the constraint (and autograd could automatically

use such code), but for many reasonable constraints, this could be difficult

to do and seems unlikely that this could be automated (other than purely

numerically) for general constraints.

Even if you were to project the gradient to be tangent to the constraint,

a *finite* optimization step (which could include momentum or weight

decay) will almost certainly move you off the constraint surface.

All in all, the only practical approach for general constraints seems to

be to let the gradient do its thing, let the finite optimization step do its

thing, have your “original” parameter not satisfy the constraint, but

then derive a tensor from the “original” parameter that does satisfy the

constraint and that is then used in the forward-pass computation.

*) It is possible for the “original” unconstrained parameter, for example

in the unit-circle case, to drift off to infinity or down to zero over the

course of many optimization steps, potentially leading to `inf`

s or `nan`

s.

In such a case, it would be logically consistent to reimpose the constraint

on the “original” parameter from time to time.

Best.

K. Frank