PyTorch's CUDA SVD function takes 3x the time of CuPy's

When running the following code, I found that PyTorch took ~9 seconds whereas CuPy took ~3 seconds. Setting the preferred linear algebra library to cuSOLVER didn’t seem to make a difference. Does anyone have any ideas why PyTorch is so much slower here, and what I might do to fix it?

import cupy
import torch
import time

def benchmark(name, f):
    f() # Warmup
    start = time.perf_counter()
    count = 0
    while True:
        f()
        curr = time.perf_counter()
        count += 1
        if curr >= start + 1:
            result = count / (curr - start)
            print(f'{name}: {result:.3g}/sec')
            return

torch.set_default_dtype(torch.float64)
torch.set_default_device('cuda')
# torch.backends.cuda.preferred_linalg_library('cusolver')

size = 500
mat = torch.rand(size, size)
c_mat = cupy.random.rand(size, size, dtype=cupy.float64) # dtype is float64

def svd_torch():
    torch.linalg.svd(mat)
    cupy.cuda.Device().synchronize()

def svd_cupy():
    cupy.linalg.svd(c_mat)
    cupy.cuda.Device().synchronize()

benchmark('svd-torch', svd_torch)
benchmark('svd-cupy', svd_cupy)

It’s expected to see a slower performance using float64 on the GPU and while I’m not deeply familiar with cupy it seems the CPU is used (based on the dtype). Using float32 gives an expected speedup:

svd-torch: 42.5/sec
svd-cupy: 9.54/sec

Generally CuPy is on the GPU, and in fact in the docs for this method, it mentions that it calls cuSOLVER (cupy.linalg.svd — CuPy 13.0.0 documentation). When I run this myself for a 64-bit double matrix using cuSOLVER directly, with cusolverDnDgesvd, I get about 5 iterations per second. The difference between CuPy and this may be due to it using some other algorithm, e.g. gesvdj.