Hi, I’ve been looking into torch.cholesky and torch.cholesky_inverse for large matrices and have experienced some problems with them.
When I run the following script:
import torch
A = torch.randn(30000, 40000, device='cuda')
A = A @ A.t()
%time L = torch.cholesky(A)
del A
%time A_inv = torch.cholesky_inverse(L)
it prints:
CPU times: user 24.5 s, sys: 352 ms, total: 24.9 s
Wall time: 22 s
CPU times: user 3min 36s, sys: 969 ms, total: 3min 37s
Wall time: 42.7 s
I’m really confused about the long run time of cholesky_inverse. As I understand it computes L_inv.t() @ L_inv and L_inv should be computed in O(n^2) time as L is a lower triangular matrix. However, it’s running much longer than the cholesky factorization which needs O(n^3) time. With these sizes, I would expect cholesky_inverse to take an instant.
My another doubt is about the memory consumption of cholesky. I thought that if I do torch.cholesky(A, out=A) on a 50000 x 50000 float matrix A (9.31GB) then I should not run out of memory on a 12GB GPU. However, I can see that pytorch tries to allocate another 9.31GB which throws an OOM error. My guess is that even though Cholesky decomposition can be calculated in-place, the corresponding MAGMA routine doesn’t do exactly that and allocates another 9.31 GB of memory for performance reasons?
I realized that the very slow run times were due to another user occupying the GPU node and using up all the CPU resources. Now for the same setup cholesky takes ~2.5s and the cholesky_inverse takes ~0.6s (still I would expect it would be faster, but it’s not a problem now).
I’m still however wondering if it’s possible to perform Cholesky inplace without allocating extra memory?
Another update. It turned out that what I wrote in the previous post was not really the case. I forgot about CUDA asynchronous calls and thus the timings were wrongs. I added cuda.synchronize and ran the following code:
import torch
from time import time
A = torch.randn(30000, 35000, device='cuda')
A = A @ A.t()
torch.cuda.synchronize()
t = time()
A_inv = torch.inverse(A)
torch.cuda.synchronize()
print('torch.inverse', time() - t)
t = time()
torch.cholesky(A, out=A)
torch.cuda.synchronize()
print('torch.cholesky', time() - t)
t = time()
torch.cholesky_inverse(A, out=A)
torch.cuda.synchronize()
print('torch.cholesky_inverse', time() - t)
So again, cholesky_inverse takes extremely long. I would expect it to be many times faster than cholesky but it doesn’t seem to be the case. What is going on here?