Performance of pinverse

streamnsight · January 12, 2019, 1:36am

Looking at ways to do multi-linear regression with PyTorch, since from benchmarking I find that it performs really well for basic matrix multiplication / inversion.

Now, I’m a bit confused with the performance of pinverse(), and wondering about precision also after seeing this thread:

In my benchmarks, I get a multi-linear equation solved with different methods, using 100,000 samples and 150 params:
Basic matrix inversion (see code below)
i7 2.8GHz (8 cores)

96.4 ms ± 1.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Dual Xeon E5-2650 v2 @ 2.60GHz (32 cores)

176 ms ± 4.42 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

1.8X factor here between computers, which I find odd to begin with, but that’s not even the main concern

Moore Penrose pseudo Inverse version takes:
i7 2.8GHz (8 cores)

1.87 s ± 64 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Dual Xeon E5-2650 v2 @ 2.60GHz (32 cores)

903 ms ± 38.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) on CPU, on a

Here the dual Xeon is noticeably 2x faster

As a matter of comparison, statsmodel.OLS (which used a version of Moore Penrose pseudo inverse gets me this:

i7 2.8GHz (8 cores)

1.89 s ± 129 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) on

Dual Xeon E5-2650 v2 @ 2.60GHz (32 cores)

2.8 s ± 206 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

OLS and pinverse perform about the same on the i7, but on the Xeon it’s night and day.
What is going on here? It seems due to the hardware difference, but they both has several cores and the timing doesn’t scale with the cores.

PS: timing include the numpy to tensor casting, but I benchmarked that and that is negligeable.

my code:

# simple inverse matrix linear equation solving
t = time.time()
n = x.shape[0]
m = x.shape[1]
x1 = np.append(np.ones((n, 1)), x.reshape((n, m)), 1)
tx = torch.from_numpy(x1)
ty = torch.from_numpy(y.reshape((n, 1)))
m = ((tx.transpose(0, 1).matmul(tx)).inverse()).matmul(tx.transpose(0, 1).matmul(ty))
dt = time.time() - t

# moore penrose pseudo inverse version
t = time.time()
n = x.shape[0]
m = x.shape[1]
x1 = np.append(np.ones((n, 1)), x.reshape((n, m)), 1)
tx = torch.from_numpy(x1)
ty = torch.from_numpy(y)
mpinv = torch.pinverse(tx)
m = torch.tensordot(mpinv, ty, dims=1)
dt = time.time() - t

SimonW · January 12, 2019, 6:49am

Thanks for doing these benchmarks. In PyTorch, most CPU linalg ops are done via BLAS/LAPACK (E.g., pinv uses gesdd I believe). In the shipped binary, we use Intel’s MKL. I assume that Intel is able to do more optimizations on their higher tier “workstation-level” CPUs (e.g., Xeon) than the regular i7.

streamnsight · January 14, 2019, 10:49pm

Thanks for your answer.
I understand that in the end, Lapack is used.
For the simple inversion, PyTorch is using getrf (see here) and for the Moore-Penrose pseudo inverse, it’s using LU decomposition with SVD, but using gesdd as you point out from the docs.

Maybe Intel does more optimization on their higher end CPUs, but how comes the simple matrix inversion takes so much longer on the Xeon then?

If I believe this benchmark
the Xeon should be way faster, and it is in fact slower in my benchmark, so it’s a bit confusing.

Also confusing is the fact that comparatively, statsmodel OLS, which uses a pseudo inverse algorithm using svd also implemented using the Lapack gesdd, is also slower on the Xeon, so the gesdd can’t seem to explain the difference here.

I’m going to try to benchmark the tensordot operation and see if it makes a difference, but that should be a pretty trivial operation.

wangg12 · March 28, 2019, 5:58am

I met with a similar issue as posted here https://github.com/pytorch/pytorch/issues/18558. Any ideas?