Torch.linalg.lstsq takes way too long

Hi, I am currently working with some least square problems on feedforward neural network. Long story short, I am trying to approximate some of the results obtained by the network with a linear resolution, using lstsq. The problem is the following: whereas my colleague achieve pretty fast results using lstsq (locally, not in a computing cluster), of the order of 0.5s to solve the problem, mine is taking a pretty long time (approx 30s, where the neural network is being trained with 1000 epoch in half that time). My question is, can this be somehow related with the PyTorch installation? I have heard that maybe some of the linear algebra optimized libraries are not correctly linked with my current PyTorch installation, but I really don’t know how to check/solve it.

Have had someone else problems with these methods taking way too long?

If any source code/installation configuration is needed, please let me know.

Thanks!

Code you potentially share your code and your exact install env?

Hi Cristian!

This is a long shot, but this would be expected if you’re using intel ‘xpu’ instead of nvidia’
‘cuda’ for your work.

Pytorch with xpu accelerates the bulk of the core tensor operations, but many, if not all
linalg operations aren’t supported on the xpu and fall back to the cpu:

>>> import torch
>>> torch.__version__
'2.7.1+xpu'
>>> a = torch.randn (4, 4, device = 'xpu')
>>> b = torch.randn (4, 4, device = 'xpu')
>>> torch.linalg.lstsq (a, b)
<python-input-4>:1: UserWarning: Aten Op fallback from XPU to CPU happends. This may have performance implications. If need debug the fallback ops please set environment variable `PYTORCH_DEBUG_XPU_FALLBACK=1`  (Triggered internally at /pytorch/build/xpu/ATen/RegisterXPU_0.cpp:53693.)
torch.return_types.linalg_lstsq(
solution=tensor([[ 0.9394, -1.1806,  0.2140, -0.0070],
        [-0.4890, -0.1218, -0.0116, -0.4881],
        [-2.8667, -0.9415, -0.6335, -2.3435],
        [-0.3946, -0.4613,  0.5303, -0.5121]], device='xpu:0'),
residuals=tensor([], device='xpu:0'),
rank=tensor(4, device='xpu:0'),
singular_values=tensor([], device='xpu:0'))

Best.

K. Frank

Hi Frank,
my current torch version is 2.6.0+cu124, which is cuda-based.

I think the problem may be a problem related with the BLAS library… Have you ever encountered something like that?

Hi Cristian!

No I haven’t. When cuda works for me on, say, basic tensor operations, I haven’t had any
trouble with torch.linalg (but I don’t have experience with a variety of cuda hardware).

Best.

K. Frank