Undebuggable cuda/magma error:

Hi guys,
I’m having a strange error out of a sudden (see below). I don’t know, what I changed in my system that caused this error (I changed almost nothing). Moreover, I can’t debug it at all. Because actually in doesn’t occur if I start the training in the PyCharm debug mode. Does somebody have an idea, how I could debug this? (I tried launch-blocking)

The full error output looks like this btw.:

Training: 0%| | 1/10000 [09:51<1642:14:45, 591.27s/it]CUBLAS error: out of memory (3) in magma_ssyevd_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/ssyevd_gpu.cpp:226
CUBLAS error: not initialized (1) in magma_ssyevd_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/ssyevd_gpu.cpp:226
CUBLAS error: out of memory (3) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:137
CUBLAS error: not initialized (1) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:137
CUBLAS error: out of memory (3) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:138
CUBLAS error: not initialized (1) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:138
CUBLAS error: out of memory (3) in magma_ssyevd_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/ssyevd_gpu.cpp:226
CUBLAS error: not initialized (1) in magma_ssyevd_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/ssyevd_gpu.cpp:226
CUBLAS error: out of memory (3) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:137
CUBLAS error: not initialized (1) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:137
CUBLAS error: out of memory (3) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:138
CUBLAS error: not initialized (1) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:138
CUBLAS error: out of memory (3) in magma_ssyevd_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/ssyevd_gpu.cpp:226
CUBLAS error: not initialized (1) in magma_ssyevd_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/ssyevd_gpu.cpp:226
CUBLAS error: out of memory (3) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:137
CUBLAS error: not initialized (1) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:137
CUBLAS error: out of memory (3) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:138
CUBLAS error: not initialized (1) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:138
CUBLAS error: out of memory (3) in magma_ssyevd_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/ssyevd_gpu.cpp:226
CUBLAS error: not initialized (1) in magma_ssyevd_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/ssyevd_gpu.cpp:226
CUBLAS error: out of memory (3) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:137
CUBLAS error: not initialized (1) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:137
CUBLAS error: out of memory (3) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:138
CUBLAS error: not initialized (1) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:138
CUBLAS error: out of memory (3) in magma_ssyevd_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/ssyevd_gpu.cpp:226
CUBLAS error: not initialized (1) in magma_ssyevd_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/ssyevd_gpu.cpp:226
CUBLAS error: out of memory (3) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:137
CUBLAS error: not initialized (1) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:137
CUBLAS error: out of memory (3) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:138
CUBLAS error: not initialized (1) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:138
CUBLAS error: out of memory (3) in magma_ssyevd_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/ssyevd_gpu.cpp:226
CUBLAS error: not initialized (1) in magma_ssyevd_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/ssyevd_gpu.cpp:226
CUBLAS error: out of memory (3) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:137
CUBLAS error: not initialized (1) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:137
CUBLAS error: out of memory (3) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:138
CUBLAS error: not initialized (1) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:138
CUBLAS error: out of memory (3) in magma_ssyevd_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/ssyevd_gpu.cpp:226
CUBLAS error: not initialized (1) in magma_ssyevd_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/ssyevd_gpu.cpp:226
CUBLAS error: out of memory (3) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:137
CUBLAS error: not initialized (1) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:137
CUBLAS error: out of memory (3) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:138
CUBLAS error: not initialized (1) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:138
CUBLAS error: out of memory (3) in magma_ssyevd_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/ssyevd_gpu.cpp:226
CUBLAS error: not initialized (1) in magma_ssyevd_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/ssyevd_gpu.cpp:226
CUBLAS error: out of memory (3) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:137
CUBLAS error: not initialized (1) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:137
CUBLAS error: out of memory (3) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:138
CUBLAS error: not initialized (1) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:138
CUBLAS error: out of memory (3) in magma_ssyevd_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/ssyevd_gpu.cpp:226
CUBLAS error: not initialized (1) in magma_ssyevd_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/ssyevd_gpu.cpp:226
CUBLAS error: out of memory (3) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:137
CUBLAS error: not initialized (1) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:137
CUBLAS error: out of memory (3) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:138
CUBLAS error: not initialized (1) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:138
CUBLAS error: out of memory (3) in magma_ssyevd_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/ssyevd_gpu.cpp:226
CUBLAS error: not initialized (1) in magma_ssyevd_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/ssyevd_gpu.cpp:226
CUBLAS error: out of memory (3) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:137
CUBLAS error: not initialized (1) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:137
CUBLAS error: out of memory (3) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:138
CUBLAS error: not initialized (1) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:138
CUBLAS error: out of memory (3) in magma_ssyevd_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/ssyevd_gpu.cpp:226
CUBLAS error: not initialized (1) in magma_ssyevd_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/ssyevd_gpu.cpp:226
CUBLAS error: out of memory (3) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:137
CUBLAS error: not initialized (1) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:137
CUBLAS error: out of memory (3) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:138
CUBLAS error: not initialized (1) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:138
CUBLAS error: out of memory (3) in magma_ssyevd_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/ssyevd_gpu.cpp:226
CUBLAS error: not initialized (1) in magma_ssyevd_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/ssyevd_gpu.cpp:226
CUBLAS error: out of memory (3) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:137
CUBLAS error: not initialized (1) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:137
CUBLAS error: out of memory (3) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:138
CUBLAS error: not initialized (1) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:138
python: /opt/conda/conda-bld/magma-cuda102_1583546904148/work/interface_cuda/interface.cpp:810: void magma_queue_create_internal(magma_device_t, magma_queue**, const char*, const char*, int): Assertion `queue->dCarray__ != __null’ failed.

ok, by manually inserting tons of logging messages all over my code, I finally found the place where the error occurs. It is the following line

    self.mvn = MultivariateNormal(loc=torch.zeros(d, device=data.device), covariance_matrix=cov)

When this line is called with d=1, cov=tensor([[0.1121]]) and data.device = “cuda:0” I receive the error message:
CUBLAS error: out of memory (3) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:138
CUBLAS error: not initialized (1) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:138

The training loop continues then until it comes again to this point where it produces the same error again. After a few iteration the script crashes with the error:
python: /opt/conda/conda-bld/magma-cuda102_1583546904148/work/interface_cuda/interface.cpp:810: void magma_queue_create_internal(magma_device_t, magma_queue**, const char*, const char*, int): Assertion `queue->dCarray__ != __null’ failed.
Aborted (core dumped)

I actually can’t imagine what could be so problematic about a 1-dim gaussian variable. Does anybody have a clue?

Hi, I meet the same error, have you solved it?

Hi, I also have the same issue, and also with a 1-dim gaussian variable. If anyone figures it out, please post the solution here

I have a similar error

CUBLAS error: out of memory (3) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/s
rc/spotrf_gpu.cpp:137
CUBLAS error: not initialized (1) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546
904148/work/src/spotrf_gpu.cpp:137
CUBLAS error: out of memory (3) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_158354690
4148/work/src/spotrf_gpu.cpp:138
CUBLAS error: not initialized (1) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546
904148/work/src/spotrf_gpu.cpp:138
python3.8: /opt/conda/conda-bld/magma-cuda102_1583546904148/work/interface_cuda/interface.cpp:806: void magma
_queue_create_internal(magma_device_t, magma_queue**, const char*, const char*, int): Assertion `queue->dAarr
ay__ != __null' failed.

The ugly thing is that it comes up out of nowhere. I’ve run the same code on multiple machines hundreds of times and now it crashes 1/10 times. The only thing that for sure changed recently is that before I was using Pytorch 1.7.1 and now the latest version.

Edit: one possible ugly solution could be to find in our code where this happens and pass that computation to CPU and avoid the GPU for those lines of code. It’s ugly but could be a bandaid to the problem

Edit2: After failing like this, the code cannot load more stuff on memory as it thinks it’s full (CUDA out of memory) but it isn’t