Non differentiability of complete QR decomposition for case where rows > columns

Hi,

I have an application where I am required to use the complete QR decomposition of a tall matrix (more specifically, what I am actually trying to do is find a basis of the null space for a matrix representing an underdetermined system of equations). However I see that the QR decomposition in torch does not support gradient for tall matrices. I do see some information in the literature regarding this, but before I reinvent the wheel, I was wondering for what reason, if any, torch does not support gradient for tall complete QR, so that I might avoid wasting any time.

Hi Lucas!

While the null space of a tall-matrix QR decomposition is well defined,
the individual basis vectors spanning that null space are not. This is
because any (typically orthogonal) transformation of that set of basis
vectors produces an equally-valid basis.

This is to be expected, as due to the indeterminacy of the null-space
vectors, gradients of quantities that depend on the null-space vectors
are ill defined.

(This is analogous to pytorch giving nans for gradients of diagonalizable
matrices that have degenerate eigenvalues, but are otherwise well
behaved.)

Conceptually, if you were differentiating something (some kind of loss
function, say) that depended solely on the null space in its entirety, but
not on any specific basis for the null space, that could be differentiable.
But I don’t see any way to do this with pytorch because it would seem
that any such gradient computation would have to flow back through
some (arbitrary) set of null-space basis vectors.

Without knowing more about your use case and what object you wish to
differentiate, I can’t say whether your problem is ill posed, or whether it
is potentially well posed (but maybe not something that autograd can
handle).

Best.

K. Frank

Hi Frank,

I really appreciate your response, it has been incredibly helpful. The problem I am trying to solve, generally speaking, is to make sure that my neural network will always output a valid solution. More specifically, suppose that what I am ultimately aiming to predict is a vector x. x is constrained by theory to satisfy Ax = 0. In my current implementation, I express x as a linear combination of v1, v2, … vn, where form a basis for the null space of A, and use my neural network to predict the scale factors for each of those vectors. This ensures that the solution that my neural network puts out satisfies the analytic constraints. The reason why I am moving away from this, and encountering this issue, is because I am trying to make a more generalisable model, but in order to do this, the analytic constraint becomes f(y)x = 0, where f: R^m → R^{m x n}, m < n. Conceptually therefore, it does not depend on any particular basis for the null space, but the null space formed should be consistent and continuous with respect to a small perturbation in input to whatever the function is that spits out the basis set.

I have also been considering that perhaps forming a dummy matrix A = (f(y) | D) such that the matrix is square could enforce a degree of ‘sameness’ on the basis for the null set. I believe that I can do this because (at least under a householder based implementation) the ith column of Q should not depend on the jth column of A given i < j, and I do not care about R. However, I am not sure whether this is also true for the implementation in the linalg package, although from preliminary playing around I believe it to be true. More important, and something that I am not necessarily able to intuit, is the question as to whether by keeping the dummy matrix D constant enforces continuity on the n+1th through mth columns of Q with respect to the elements of f(y).

Any further insight would be greatly appreciated,

Regards,

L5RK.

P.S - My apologies for the poor formatting - I am new to pytorch forums and have not figured out how to use LaTeX on here yet.

Hi Lucas!

I sort of think I understand what you are trying to do.

In this case, as I understand it, A is “external” to your network and
doesn’t depend on the input data (nor on any “ground-truth” target
data).

(I believe that your null space has dimensionality n - m so the vs should
run up to v_(n - m).)

Because A is an external constant, you can pre-compute some (arbitrary)
basis of A’s (well-defined) null space, v1, v2, .... You then use your
network to predict the coordinates of your output vector, x, with respect
to that basis. Because the specific choice of basis is a pre-computed
constant, those predicted coordinates always have the same meaning,
so it makes sense to train your network to predict them.

Based on your comments, I assume that this version of your scheme is
working for you.

I assume that the function f is “external” and doesn’t depend on the model
or any of the data.

Where does y come from? Is it part of the input data? Is it part of the
ground-truth target data? Or is it also predicted by some part of your
trained network (or itself a trained parameter)?

Unless y (and hence f (y)) depends on some trained parameters
(and is used in the training of those parameters), I don’t see why the
differentiability of the QR decomposition would enter into your problem.

So, to confirm my understanding, you have a null space that is determined
by f (y) (formerly the matrix A), which can vary (but “continuously”). Your
network predicts the coordinates of your output vector, x, with respect to
some conceptually-arbitrary basis of that null space. In order for those
predicted coordinates to have some stable (and, hence, predictable)
meaning, you need your null-space basis to depend “continuously” on f (y).

Here’s a scheme that I think might suffice to break the arbitrariness of the
null-space basis “by hand”:

Your predicted vector, x, conceptually has dimension n. But it satisfies
the m constraints given by f (y) @ x = 0, where f (y) is a matrix of
dimension [m, n], so the overall constraint is that x lie in the null space
of f (y). Therefore x only has n - m actual degrees of freedom (and
n - m is the dimension of the null space).

Let k = n - m denote the dimension of the null space. Generate k random
vectors of dimension n. (Concretely I have in mind generating gaussian
vectors and normalizing them to have unit length. I would probably use
low-discrepancy “random” numbers to get more uniform coverage of the
n-dimensional space.)

It is highly unlikely that any of these vectors will be orthogonal to the null
space.

Call these vectors c1, c2, ..., ck. They are pre-computed and held
constant, as they do not depend on f (y). Think of them as the “seeds”
for a smoothly-varying basis of the smoothly-varying null space of f (y).

For a given f (y), compute its null space from its QR decomposition.
(I am ignoring differentiability here.) Project the c1, c2, ... (the seed
basis) to obtain b1, b2, ..., the desired smoothly-varying null-space
basis.

(I imagine that it could be helpful to orthogonalize the bi. If you iteratively
orthogonalize the bi – perhaps purposefully with a small step size – the
resulting orthogonal bi should remain smoothly varying as a function of
f (y).)

If you require autograd-differentiability of the bi with respect to f (y),
you could use the following modified scheme. Compute the reduced
QR decomposition, which is differentiable. This gives you a (well-defined)
basis for the “non-null space” of f (y). Obtain the bi by projecting the
ci out of the non-null space, and hence into the null space, by using
the non-null-space basis given by the reduced QR decomposition.

My apologies if I’ve misunderstood somewhere along the way what you
are trying to do.

It would be helpful if you could describe the conceptual details of your
use case. What do the input data to your network actually mean? What
do the target data mean? How much data do you have? What does
the matrix A actually mean, where does it come from, and what is its
dimensionality? What kind of structure does your constraint function, f,
have, and where does it come from? What is the architecture of your
network and what optimizer are you using?

Good luck!

K. Frank

Hi K,

I believe you’ve understood the problem quite well. I can add some clarifications for completeness, but I think that you’ve given me a pretty good starting point for me to implement an early solution!

I assume that the function f is “external” and doesn’t depend on the model
or any of the data.

This is true - my problem more specifically is as follows. Given a set of data points (prices) associated with an upper and lower bounds (ask and bid prices respectively), I am trying to find a set of C2 continuous cubic splines that match the data well. The cubic splines are joined at knots at (x_0, x_1, ... x_l) and the original matrix A enforces C2 continuity between the splines at a predetermined set of xs. My current goal is to make the placement of the knots a learnable property, based on the data, hence y is an output of the trainable part of the net. This should mean that I should get f(predicted knot location) = (a system that enforces continuity). Right now the structure of trainable part of the net is very basic (A stack of LeakyReLU and fully connected linear layers), and I am using SGD for simplicity as well. Because I want the errors of my loss function to propagate through f(y) such that the placement of the splines can be learned, I do require autograd compatability.

If you require autograd-differentiability of the bi with respect to f (y) ,
you could use the following modified scheme. Compute the reduced
QR decomposition, which is differentiable. This gives you a (well-defined)
basis for the “non-null space” of f (y) . Obtain the bi by projecting the
ci out of the non-null space, and hence into the null space, by using
the non-null-space basis given by the reduced QR decomposition.

This solution I think addresses my problem perfectly. The only thing that I remain mildly concerned about is the numerical stability of these projections, although I suppose that it is not something easily avoidable in a general sense.

Thank you so much for your time,

Regards,
Lucas.

Hi Lukas!

If the knot locations encoded in A are in roughly similar spots as the
learnable knots, you might get some benefit by using an orthonormal
basis of A 's null space as your “seed” basis.

I don’t have a clear understanding of the C2 continuity conditions, but I
would imagine that they and the matrix f (y) would become degenerate
and / or singular when knots coincide. You could perhaps add a regulator
(as a loss term for y) that causes the knots to repel one another.

Again, based on whether or not A is somewhat similar to f (y), you could
even add a regulator that causes the predicted knot locations (with some
specific labelling scheme) to be attracted to the corresponding knot locations
of A. This could also help stabilize the knots to remain “sensible” and not
become singular.

Best.

K. Frank