I’ve found the `torch.svd_lowrank`

to be up to 2x slower in CPU and GPU for the batched implementation when compared to a loop implementation.

I guess if then the bacthed implemenation could be turned into a loop one so it is faster.

Simple test:

```
import torch
p = torch.randn(7000, 3)
d = torch.cdist(p, p, p=2)
# optional
# d = d.to(torch.device("cuda:0"))
# loop implementation - faster
u,s,v = [], [], []
for i in range(5):
u_, s_, v_ = torch.svd_lowrank(d)
u.append(u_)
s.append(s_)
v.append(v_)
u = torch.stack(u, dim=0)
s = torch.stack(s, dim=0)
v = torch.stack(v, dim=0)
# batched implementation - 2x slower
u,s,v = torch.svd_lowrank(torch.stack([d]*5, dim=0))
```

**Note**: i have found batched and loop implementations to be on par for low matrix sizes (n<2000 for n x n matrices), but very different for big sizes (both in cpu and gpu)