Batched SVD_LOWRANK being much slower than loop implementation (both CPU and GPU)

I’ve found the torch.svd_lowrank to be up to 2x slower in CPU and GPU for the batched implementation when compared to a loop implementation.

I guess if then the bacthed implemenation could be turned into a loop one so it is faster.

Simple test:

import torch
p = torch.randn(7000, 3)
d = torch.cdist(p, p, p=2)
# optional
# d = d.to(torch.device("cuda:0"))
# loop implementation - faster
u,s,v = [], [], []
for i in range(5):
    u_, s_, v_ = torch.svd_lowrank(d)
    u.append(u_)
    s.append(s_)
    v.append(v_)
u = torch.stack(u, dim=0)
s = torch.stack(s, dim=0)
v = torch.stack(v, dim=0)
# batched implementation - 2x slower
u,s,v = torch.svd_lowrank(torch.stack([d]*5, dim=0))

Note: i have found batched and loop implementations to be on par for low matrix sizes (n<2000 for n x n matrices), but very different for big sizes (both in cpu and gpu)