The tool torch.nn.utils.parametrize allows to register parametrizations for the weights of a layer. For instance, it allows us to factor the weights of layer into triangular matrices so that the layer is symmetric. Does autograd compute gradients for the parametrized factors? That is, do optimization steps keep the layer symmetric?

Hey!

the gradients will indeed “flow” all the way to the new Parameters (before the parametrization is applied). And the optimization is then applied pre-parametrization (since these are the new parameters).
This means that your parametrization will re-run after the update and thus the property provided by it will be preserved indeed.

But in that way, the optimization step by itself does not seem to preserve the restriction imposed by the parametrization. The restriction will be imposed by applying the parametrization again after optimization. In this case, autograd does not calculate the gradient for the factors of the parametrization.

Hi André!

This is correct. But I do believe that pytorch’s scheme for optimizing
`model.parametrizations.weight.original`, that is, your model’s
“original” unconstrained `Parameter`, works correctly.*

Yes, this is also correct.

No, not in the sense that an actual layer weight stored in memory is kept
symmetric (or otherwise respects the constraint that your parametrization
imposes).

But what would it take to do this?

Consider a parameter that is a single two-dimensional point, `[x, y]`,
that is constrained to have unit length. That is to say, the parameter
is a single point on a circle of radius one.

(You can consider this to be a baby version of a matrix that is constrained
to be orthonormal for which similar, but more nuanced issues arise.)

to `[x, y]`, there is no reason for that gradient to lie tangent to the unit
circle to which `[x, y]` is constrained. You could add a feature so that
when you write your parametrization you also write code to project the
use such code), but for many reasonable constraints, this could be difficult
to do and seems unlikely that this could be automated (other than purely
numerically) for general constraints.

Even if you were to project the gradient to be tangent to the constraint,
a finite optimization step (which could include momentum or weight
decay) will almost certainly move you off the constraint surface.

All in all, the only practical approach for general constraints seems to
be to let the gradient do its thing, let the finite optimization step do its
thing, have your “original” parameter not satisfy the constraint, but
then derive a tensor from the “original” parameter that does satisfy the
constraint and that is then used in the forward-pass computation.

*) It is possible for the “original” unconstrained parameter, for example
in the unit-circle case, to drift off to infinity or down to zero over the
course of many optimization steps, potentially leading to `inf`s or `nan`s.
In such a case, it would be logically consistent to reimpose the constraint
on the “original” parameter from time to time.

Best.

K. Frank