First call to eigh(), but not subsequent calls, gives "cusolver error: CUSOLVER STATUS EXECUTION FAILED"

KFrank · June 26, 2022, 10:36pm

Hi Forum!

I see a cuda error (in a particular situation) the first time I call eigh(),
but not on subsequent calls. (This is on the current stable, 1.11.0, but
I also see it on the latest nightly, 1.13.0.dev20220626.)

By “first time” I mean that I start a new python session and then run the
script below – the first call fails and the second succeeds. If I rerun the
script in the same python session, both calls succeed.

Here is the code that reliably reproduces the issue for me:

import torch

print (torch.__version__)
print (torch.version.cuda)
print (torch.cuda.get_device_name())

_ = torch.manual_seed (2022)

matrices = torch.randn (10000, 200, 200, device = 'cuda')

print ('eigh first attempt ...')
try:
    torch.linalg.eigh (matrices)
except Exception as e:
    print ('exception caught:', str (e))

print ('eigh second attempt ...')
try:
    torch.linalg.eigh (matrices)
except Exception as e:
    print ('exception caught:', str (e))

And here is its output:

1.11.0
11.3
GeForce GTX 1050 Ti
eigh first attempt ...
exception caught: cusolver error: CUSOLVER_STATUS_EXECUTION_FAILED, when calling `cusolverDnSsyevj( handle, jobz, uplo, n, A, lda, W, work, lwork, info, params)`. This error may appear if the input matrix contains NaN.
eigh second attempt ...

Best.

K. Frank

ptrblck · June 27, 2022, 12:30am

Hi @KFrank,

thanks for creating the issue and code snippet!
So far I was unable to reproduce the issue using 1.13.0.dev20220626+cu113 on a P100, which is of course not the same device you are using (same architecture though).
While I’m trying to grab a 1050Ti (or a more comparable GPU) could you try to install the latest CUDA11.6 nightlies and see if you would still hit the issue?
This should work:

pip install --pre torch --extra-index-url https://download.pytorch.org/whl/nightly/cu116

KFrank · June 27, 2022, 12:54am

Hi @ptrblck!

Yes, I will give the CUDA11.6 nightly a try.

I have been using conda exclusively for pytorch (except for that old 0.3.0
version I was playing with a couple of years back), so I may need some
coaching on pip. Should I run that pip command within a conda environment
to avoid conflicts with other versions of pytorch?

Or to ask it another way, how should I manage using conda on the one hand
and pip on the other?

Thanks.

K. Frank

ptrblck · June 27, 2022, 1:22am

Yes, creating a new conda env and using the posted pip install command should still work (this is how I’m testing the binaries usually). In any case, let me post the conda install command later as an update in case you want to use these binaries.

KFrank · June 27, 2022, 2:15am

Hi @ptrblck!

Thanks. Running your pip install command in a new conda environment
worked.

Using the cu116 nightly caused the first-call “cusolver error” to go away.
Here is the output of the script I posted above:

1.13.0.dev20220626+cu116
11.6
GeForce GTX 1050 Ti
eigh first attempt ...
eigh second attempt ...

As an aside, the gpu slowness (relative to the cpu) persists with the cu116
nightly.

(Also, the error still happened with the cu113 nightly after I tried rebooting
my machine just in case the gpu was in some confused state, although, for
me, the symptom that gets fixed by rebooting is the gpu not working at all
with pytorch.)

Best.

K. Frank