Undebuggable cuda/magma error:

Hi guys,
I’m having a strange error out of a sudden (see below). I don’t know, what I changed in my system that caused this error (I changed almost nothing). Moreover, I can’t debug it at all. Because actually in doesn’t occur if I start the training in the PyCharm debug mode. Does somebody have an idea, how I could debug this? (I tried launch-blocking)

The full error output looks like this btw.:

Training: 0%| | 1/10000 [09:51<1642:14:45, 591.27s/it]CUBLAS error: out of memory (3) in magma_ssyevd_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/ssyevd_gpu.cpp:226
CUBLAS error: not initialized (1) in magma_ssyevd_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/ssyevd_gpu.cpp:226
CUBLAS error: out of memory (3) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:137
CUBLAS error: not initialized (1) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:137
CUBLAS error: out of memory (3) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:138
CUBLAS error: not initialized (1) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:138
CUBLAS error: out of memory (3) in magma_ssyevd_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/ssyevd_gpu.cpp:226
CUBLAS error: not initialized (1) in magma_ssyevd_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/ssyevd_gpu.cpp:226
CUBLAS error: out of memory (3) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:137
CUBLAS error: not initialized (1) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:137
CUBLAS error: out of memory (3) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:138
CUBLAS error: not initialized (1) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:138
CUBLAS error: out of memory (3) in magma_ssyevd_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/ssyevd_gpu.cpp:226
CUBLAS error: not initialized (1) in magma_ssyevd_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/ssyevd_gpu.cpp:226
CUBLAS error: out of memory (3) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:137
CUBLAS error: not initialized (1) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:137
CUBLAS error: out of memory (3) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:138
CUBLAS error: not initialized (1) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:138
CUBLAS error: out of memory (3) in magma_ssyevd_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/ssyevd_gpu.cpp:226
CUBLAS error: not initialized (1) in magma_ssyevd_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/ssyevd_gpu.cpp:226
CUBLAS error: out of memory (3) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:137
CUBLAS error: not initialized (1) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:137
CUBLAS error: out of memory (3) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:138
CUBLAS error: not initialized (1) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:138
CUBLAS error: out of memory (3) in magma_ssyevd_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/ssyevd_gpu.cpp:226
CUBLAS error: not initialized (1) in magma_ssyevd_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/ssyevd_gpu.cpp:226
CUBLAS error: out of memory (3) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:137
CUBLAS error: not initialized (1) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:137
CUBLAS error: out of memory (3) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:138
CUBLAS error: not initialized (1) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:138
CUBLAS error: out of memory (3) in magma_ssyevd_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/ssyevd_gpu.cpp:226
CUBLAS error: not initialized (1) in magma_ssyevd_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/ssyevd_gpu.cpp:226
CUBLAS error: out of memory (3) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:137
CUBLAS error: not initialized (1) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:137
CUBLAS error: out of memory (3) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:138
CUBLAS error: not initialized (1) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:138
CUBLAS error: out of memory (3) in magma_ssyevd_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/ssyevd_gpu.cpp:226
CUBLAS error: not initialized (1) in magma_ssyevd_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/ssyevd_gpu.cpp:226
CUBLAS error: out of memory (3) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:137
CUBLAS error: not initialized (1) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:137
CUBLAS error: out of memory (3) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:138
CUBLAS error: not initialized (1) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:138
CUBLAS error: out of memory (3) in magma_ssyevd_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/ssyevd_gpu.cpp:226
CUBLAS error: not initialized (1) in magma_ssyevd_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/ssyevd_gpu.cpp:226
CUBLAS error: out of memory (3) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:137
CUBLAS error: not initialized (1) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:137
CUBLAS error: out of memory (3) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:138
CUBLAS error: not initialized (1) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:138
CUBLAS error: out of memory (3) in magma_ssyevd_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/ssyevd_gpu.cpp:226
CUBLAS error: not initialized (1) in magma_ssyevd_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/ssyevd_gpu.cpp:226
CUBLAS error: out of memory (3) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:137
CUBLAS error: not initialized (1) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:137
CUBLAS error: out of memory (3) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:138
CUBLAS error: not initialized (1) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:138
CUBLAS error: out of memory (3) in magma_ssyevd_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/ssyevd_gpu.cpp:226
CUBLAS error: not initialized (1) in magma_ssyevd_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/ssyevd_gpu.cpp:226
CUBLAS error: out of memory (3) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:137
CUBLAS error: not initialized (1) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:137
CUBLAS error: out of memory (3) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:138
CUBLAS error: not initialized (1) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:138
CUBLAS error: out of memory (3) in magma_ssyevd_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/ssyevd_gpu.cpp:226
CUBLAS error: not initialized (1) in magma_ssyevd_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/ssyevd_gpu.cpp:226
CUBLAS error: out of memory (3) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:137
CUBLAS error: not initialized (1) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:137
CUBLAS error: out of memory (3) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:138
CUBLAS error: not initialized (1) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:138
CUBLAS error: out of memory (3) in magma_ssyevd_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/ssyevd_gpu.cpp:226
CUBLAS error: not initialized (1) in magma_ssyevd_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/ssyevd_gpu.cpp:226
CUBLAS error: out of memory (3) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:137
CUBLAS error: not initialized (1) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:137
CUBLAS error: out of memory (3) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:138
CUBLAS error: not initialized (1) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:138
python: /opt/conda/conda-bld/magma-cuda102_1583546904148/work/interface_cuda/interface.cpp:810: void magma_queue_create_internal(magma_device_t, magma_queue**, const char*, const char*, int): Assertion `queue->dCarray__ != __null’ failed.

ok, by manually inserting tons of logging messages all over my code, I finally found the place where the error occurs. It is the following line

    self.mvn = MultivariateNormal(loc=torch.zeros(d, device=data.device), covariance_matrix=cov)

When this line is called with d=1, cov=tensor([[0.1121]]) and data.device = “cuda:0” I receive the error message:
CUBLAS error: out of memory (3) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:138
CUBLAS error: not initialized (1) in magma_spotrf_LL_expert_gpu at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_gpu.cpp:138

The training loop continues then until it comes again to this point where it produces the same error again. After a few iteration the script crashes with the error:
python: /opt/conda/conda-bld/magma-cuda102_1583546904148/work/interface_cuda/interface.cpp:810: void magma_queue_create_internal(magma_device_t, magma_queue**, const char*, const char*, int): Assertion `queue->dCarray__ != __null’ failed.
Aborted (core dumped)

I actually can’t imagine what could be so problematic about a 1-dim gaussian variable. Does anybody have a clue?