Torch.svd is slow in GPU compared to CPU

Sabra_Ossen · December 4, 2017, 6:58pm

In the following graph, you can see the execution time in seconds taken for Torch.svd in both CPU and GPU. As you can see the time taken for GPU is larger than that time taken for CPU and keeps increasing as we increase the number of records. Please note that the GPU memory has a size of 12 GB and the maximum number of records that can be run with svd is 3.25 million.

Following is the code I used.

import torch
from datetime import datetime
import sys

n_records = sys.argv[1];
x = torch.zeros(n_records, 300).cuda()
s = datetime.now()
u, _, _ = torch.svd(x, some=True)
e = datetime.now()
del x
el = (e-s).total_seconds()
print(el)

Following is the output from NVIDIA visual profiler for 1000 and 1 million records respectively. As you can see after the ‘kernelPointWiseApply2’ cuda kernel invocation there is a certain amount of idle time that is introduced. This idle time between the first and second kernel functions in the “compute” section has also increased.

Could anyone help me in understanding why this idle time occurs and why it keeps increasing as we increase the size of the dataset?

SimonW · December 4, 2017, 7:06pm

This is because magma’s implementation of ?gesvd is used under the hood. Unfortunately, the implementation is not purely on GPU and still calls LAPACK/BLAS functions (https://github.com/maxhutch/magma/blob/master/src/dgesvd.cpp). That is probably why you are seeing these idle times on GPU.

Sabra_Ossen · December 5, 2017, 1:59am

So my understanding here is that PyTorch uses a modified version of magma (which includes more BLAS) and not the original magma. Where in the code is the magma function actually invoked, is it https://github.com/pytorch/pytorch/blob/master/aten/src/THC/generic/THCTensorMathMagma.cu#L273? I would like to get an idea of the path of invocation for SVD.

SimonW · December 5, 2017, 4:16am

No. magma calls LAPACK/BLAS. PyTorch doesn’t modify it. There are several LAPACK calls in ?gesdd of magma https://github.com/maxhutch/magma/blob/master/src/sgesdd.cpp.

Sorry I was linking the wrong file.

And yes, the link you provided is part of the execution path.

Sabra_Ossen · December 5, 2017, 10:42pm

Sorry, I think you might have misunderstood the first line of my reply. What I meant was that as I understood this (http://icl.cs.utk.edu/magma/software/index.html) is the magma original library. And this (https://github.com/maxhutch/magma/blob/master/src/sgesdd.cpp) is a fork of magma to include more BLAS. The second is what PyTorch uses. Is that correct?

And yes from what I see in the file you pointed out this is not a pure GPU implementation, there are several calls to LAPACK. Thanks very much for the clarification.

SimonW · December 5, 2017, 10:45pm

Oh I see. Sorry about that. I just took the first google result and didn’t look at the owner name

I believe we use the original magma.

Sabra_Ossen · December 5, 2017, 11:31pm

If so I believe this should be the file (https://bitbucket.org/icl/magma/src/d646183b0ca94c48ebb8aa1ce75869e98766a605/src/dgesdd.cpp?at=default&fileviewer=file-view-default). I wasn’t able to find a sgesdd.cpp in the magma repo (https://bitbucket.org/icl/magma/src/d646183b0ca9/src/?at=default).

linyu · May 9, 2018, 1:02pm

Hi, do you solve the problem of GPU is slower than CPU. I encounter the same issue when using the function torch.gesv. Waiting for your apply, thank you!

chenhao1umbc · August 6, 2019, 1:35am

I got the same issue here…

temp.shape
Out[14]: torch.Size([1140, 1140])
which is low rank structure

u,s,v = torch.svd(temp)
it took 3.5 seconds!
vs
u,s,v = torch.svd(temp.cpu())
it took 0.23 seconds!

ridiculously slow here…

CadyWang · February 1, 2020, 7:12am

Hello, have you solved this problem?I also encountered, do not know how to solve