Why is it so hard to enforce a weight matrix to be orthogonal?

KFrank · January 16, 2021, 5:51am

Hi Zeyuyun!

I don’t know if what I said can be made precise – for me it is analogy and
intuition.

Go back to the circle – let’s say of radius 1 in the x-y plane. Consider
a point off the circle, (0.01, 1.50). The closest point on the circle is
approximately (0.0, 1.0). So x = 0.01 is nearly right, while y = 1.50
is a ways off.

With the sum-of-absolute-values, the gradient will be of magnitude one
in both directions. So a gradient-descent step will change both x and y
by the same amount, even though y should be changed much more
than x. With the sum-of-squares, the gradient will be directed outward
from the center of the circle, so a gradient-descent step will move the
point (x, y) directly towards the center of the circle, which is the same
as moving it directly towards the nearest point on the circle (which is
what we want).

Now, depending on the learning rate, we might not move all the way to
the circle, staying on the outside. Or we might overshoot, jumping to the
inside of the circle, but at least we’re moving in the right direction.

(Also, the sum-of-squares is “softer.” As you get closer to the circle, the
gradients become smaller, and you take smaller steps, which tends to be
good. With the sum-of-absolute-values the magnitude of both components
of the gradient will always be 1, regardless of how close you are to the
circle, so you can easily get in a situation where you keep jumping back
and forth between the inside and outside of the circle, without actually
getting closer to the circle (unless you have a scheme for reducing your
learning rate while you are doing this).)

The geometry of 500x500 orthogonal matrices has much more structure
than that of a circle, but this is the basic idea of what is going on. (At least
this is what I think is going on …)

Best.

K. Frank