Shouldn't we project the gradients when the weight matrix is constrained?

Tomarchelone · January 12, 2022, 12:24pm

Using torch.nn.utils.parametrizations one can enforce some constraint on weight matrix: we can make the matrix orthogonal or make it have spectral norm of 1. This is achieved by recomputing the weights in forward pass.

However, the gradients of the weights are not constrained in backward pass. This leads to a situation when the gradient of the weight matrix can have a component pointing outside the desired set of parameters (for example orthogonal matrices). This component is redundant, but can be very large. This seems potentially harmful, as for example when using gradient clipping this component will be considered when computing gradient norm.

Can we project the gradients so they are always pointing inside the desired set? Should we care to try to reimplement these parametrizations, so they can support gradient projection?