Torch.svd is slow in GPU compared to CPU

In the following graph, you can see the execution time in seconds taken for Torch.svd in both CPU and GPU. As you can see the time taken for GPU is larger than that time taken for CPU and keeps increasing as we increase the number of records. Please note that the GPU memory has a size of 12 GB and the maximum number of records that can be run with svd is 3.25 million.

Following is the code I used.

import torch
from datetime import datetime
import sys

n_records = sys.argv[1];
x = torch.zeros(n_records, 300).cuda()
s =
u, _, _ = torch.svd(x, some=True)
e =
del x
el = (e-s).total_seconds()

Following is the output from NVIDIA visual profiler for 1000 and 1 million records respectively. As you can see after the ‘kernelPointWiseApply2’ cuda kernel invocation there is a certain amount of idle time that is introduced. This idle time between the first and second kernel functions in the “compute” section has also increased.

Could anyone help me in understanding why this idle time occurs and why it keeps increasing as we increase the size of the dataset?

This is because magma’s implementation of ?gesvd is used under the hood. Unfortunately, the implementation is not purely on GPU and still calls LAPACK/BLAS functions ( That is probably why you are seeing these idle times on GPU.

1 Like

So my understanding here is that PyTorch uses a modified version of magma (which includes more BLAS) and not the original magma. Where in the code is the magma function actually invoked, is it I would like to get an idea of the path of invocation for SVD.

No. magma calls LAPACK/BLAS. PyTorch doesn’t modify it. There are several LAPACK calls in ?gesdd of magma

Sorry I was linking the wrong file.

And yes, the link you provided is part of the execution path.

1 Like

Sorry, I think you might have misunderstood the first line of my reply. What I meant was that as I understood this ( is the magma original library. And this ( is a fork of magma to include more BLAS. The second is what PyTorch uses. Is that correct?

And yes from what I see in the file you pointed out this is not a pure GPU implementation, there are several calls to LAPACK. Thanks very much for the clarification.

Oh I see. Sorry about that. I just took the first google result and didn’t look at the owner name :smiley:

I believe we use the original magma.

If so I believe this should be the file ( I wasn’t able to find a sgesdd.cpp in the magma repo (

Hi, do you solve the problem of GPU is slower than CPU. I encounter the same issue when using the function torch.gesv. Waiting for your apply, thank you!


I got the same issue here…

Out[14]: torch.Size([1140, 1140])
which is low rank structure

u,s,v = torch.svd(temp)
it took 3.5 seconds!
u,s,v = torch.svd(temp.cpu())
it took 0.23 seconds!

ridiculously slow here…

Hello, have you solved this problem?I also encountered, do not know how to solve