`torch.linalg.svd` uses `cudaMemcpyAsync` that syncs between host and device

As I understand and profile, I believe it might perform some checks on the numerical issue like cholesky, but it does not have a _ex variant that does not sync.

Do we have a solution to this?