Using Torch to optimize a string of matrix products

Albert_Akhriev · January 22, 2021, 6:28pm

Dear All,
I would like to employ PyTorch to optimize a string of matrix products:

min Frobenius_norm( M_1(p_1) * M_2(p_2) * M_3(p_3) * … * M_n(p_n) - T )

where:
M_i(p_i) is the i-th matrix of known structure parameterized by the vector of parameters p_i (unique for every matrix),
T is some target matrix we want to approximate by the product,
n is the “depth” of the product,
and the goal is to minimize Frobenius norm of difference between the target and approximating string of matrices.

It is not difficult to write down a formal procedure, which includes back-propagation algorithm. The major problem is computational efficiency, considering all matrices are sparse.

I would appreciate any suggestion. Could anyone point me out an implementation example?

KFrank · January 22, 2021, 9:48pm

Hi Albert!

The short story is that I recommend exploring a standard non-linear
least-squares optimization routine for your problem, rather than using
pytorch’s more general autograd / optimizer tools.

I don’t have much to say about using pytorch for this, but I do have some
comments, below:

The basic suggestion will be to perform the optimization iteratively,
optimizing one M_i (p_i) at a time, looping over the M_i (p_i)
repeatedly.

First, I assume that the matrix M (p) is a deterministic function of the
parameters, p, and that you are optimizing the p (rather than optimizing
M directly). My comments apply both to optimizing p and to optimizing
M, noting that optimizing M directly is easier, should that be your use
case.

If you fix M_2, M_3, ..., M_n (by fixing p_2, p_3, ..., p_n), then
the elements of your matrix product are linear functions of the elements
of M_1. So (the square of) your Frobenius-norm loss function is a sum
of squares. If you are optimizing M directly, this can be done in one
step using (numerical) linear algebra. But even if you are optimizing
the p, the optimization of any give p_i is still a sum of squares (that
depend, potentially non-linearly, on the parameters). This is a substantive
special structure that makes iterative optimization of this non-linear
sum-of-squares problem significantly cheaper and more robust (see,
for example, Gauss-Newton), and there are many pre-packaged tools
available for doing this.

(Note, that even optimizing all of the p_i at once is still a non-linear
sum-of-squares problem to which these tools should be applicable,
and performing the joint optimization may turn out to be more efficient.
But my intuition tells me that iterating over optimizing each p_i one at
a time could well work better.)

Please note that the product of two sparse matrices is not, in general,
as sparse as the two factors, so an algorithm that has you explicitly
calculating the product of many sparse matrices is likely to lose any
efficiency benefit of the sparsity.

You haven’t told us the sparsity structure of your matrices. Are they
sparse, but the locations of the non-zero elements are random? Or
do they have structure, such as being banded or block-diagonal?

Are the locations of the non-zero elements of M_i (p_i) constant, or
do these locations depend on the values of p_i?

Can you use the sparsity structure to dramatically cheapen multiplying
the matrices together? Is the structure such that the final product of the
matrices is, itself, sparse?

You can certainly explore pytorch’s sparse-tensor support for these
manipulations, but be forewarned that sparse-tensor calculations often
don’t map well to gpu vectorization, so any time savings may prove
negligible, or even non-negligibly negative.

Good luck.

K. Frank

Albert_Akhriev · January 26, 2021, 10:00pm

Thank you very much for insightful comment. I totally agree that the sparsity structure becomes an issue as the length of the sequence gets longer.