CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling cublasCreate(handle)

CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling cublasCreate(handle)
I’m trying to use torch.mm() but sometimes I got this error.

I run the code under Pytorch 1.4 with CUDA 9.0
Meanwhile, I also run the code with Pytorch 1.5, CUDA 9.0. And sometimes, I also get this bug.
More details in #180

This error might be raised, if you are running out of memory and cublas fails to create the handle, so try to reduce the memory usage e.g. via a smaller batch size.

3 Likes

Hi, I get the same error.
Everything works fine when I use only one gpu. But, if I use two, I get the same error. And, I reduced the size of my problem so it runs and finishes in one cpu swiftly.

File “/home/server/Escritorio/AlejandroF/agfa/simple-transformers/train_language_model.py”, line 138, in
model.train_model(train_filename)
File “/home/server/anaconda3/envs/agfa/lib/python3.9/site-packages/simpletransformers/language_modeling/language_modeling_model.py”, line 431, in train_model
global_step, training_details = self.train(
File “/home/server/anaconda3/envs/agfa/lib/python3.9/site-packages/simpletransformers/language_modeling/language_modeling_model.py”, line 774, in train
model(inputs, labels=labels)
File “/home/server/anaconda3/envs/agfa/lib/python3.9/site-packages/torch/nn/modules/module.py”, line 727, in _call_impl
result = self.forward(*input, **kwargs)
File “/home/server/anaconda3/envs/agfa/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py”, line 161, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File “/home/server/anaconda3/envs/agfa/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py”, line 171, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File “/home/server/anaconda3/envs/agfa/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py”, line 86, in parallel_apply
output.reraise()
File “/home/server/anaconda3/envs/agfa/lib/python3.9/site-packages/torch/_utils.py”, line 428, in reraise
raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
File “/home/server/anaconda3/envs/agfa/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py”, line 61, in _worker
output = module(*input, **kwargs)
File “/home/server/anaconda3/envs/agfa/lib/python3.9/site-packages/torch/nn/modules/module.py”, line 727, in _call_impl
result = self.forward(*input, **kwargs)
File “/home/server/anaconda3/envs/agfa/lib/python3.9/site-packages/transformers/models/bert/modeling_bert.py”, line 1329, in forward
outputs = self.bert(
File “/home/server/anaconda3/envs/agfa/lib/python3.9/site-packages/torch/nn/modules/module.py”, line 727, in _call_impl
result = self.forward(*input, **kwargs)
File “/home/server/anaconda3/envs/agfa/lib/python3.9/site-packages/transformers/models/bert/modeling_bert.py”, line 991, in forward
encoder_outputs = self.encoder(
File “/home/server/anaconda3/envs/agfa/lib/python3.9/site-packages/torch/nn/modules/module.py”, line 727, in _call_impl
result = self.forward(*input, **kwargs)
File “/home/server/anaconda3/envs/agfa/lib/python3.9/site-packages/transformers/models/bert/modeling_bert.py”, line 582, in forward
layer_outputs = layer_module(
File “/home/server/anaconda3/envs/agfa/lib/python3.9/site-packages/torch/nn/modules/module.py”, line 727, in _call_impl
result = self.forward(*input, **kwargs)
File “/home/server/anaconda3/envs/agfa/lib/python3.9/site-packages/transformers/models/bert/modeling_bert.py”, line 470, in forward
self_attention_outputs = self.attention(
File “/home/server/anaconda3/envs/agfa/lib/python3.9/site-packages/torch/nn/modules/module.py”, line 727, in _call_impl
result = self.forward(*input, **kwargs)
File “/home/server/anaconda3/envs/agfa/lib/python3.9/site-packages/transformers/models/bert/modeling_bert.py”, line 401, in forward
self_outputs = self.self(
File “/home/server/anaconda3/envs/agfa/lib/python3.9/site-packages/torch/nn/modules/module.py”, line 727, in _call_impl
result = self.forward(*input, **kwargs)
File “/home/server/anaconda3/envs/agfa/lib/python3.9/site-packages/transformers/models/bert/modeling_bert.py”, line 267, in forward
mixed_query_layer = self.query(hidden_states)
File “/home/server/anaconda3/envs/agfa/lib/python3.9/site-packages/torch/nn/modules/module.py”, line 727, in _call_impl
result = self.forward(*input, **kwargs)
File “/home/server/anaconda3/envs/agfa/lib/python3.9/site-packages/torch/nn/modules/linear.py”, line 93, in forward
return F.linear(input, self.weight, self.bias)
File “/home/server/anaconda3/envs/agfa/lib/python3.9/site-packages/torch/nn/functional.py”, line 1692, in linear
output = input.matmul(weight.t())
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling cublasCreate(handle)

I meant " in one gpu swiftly."

nn.DataParallel (which seems to be used in your use case) could create an imbalanced memory usage and could thus cause an OOM on the default device, which is why we recommend to use DistributedDataParallel with a single process per GPU instead.
I assume you are also running OOM, so I would recommend to try to apply DDP.

Thanks for your answer.

Hi, thanks for your reply. Is there any way that I can save the memories? I have tried to use torch.einsum but seems this cannot save the memories. Any Idea?

Here is an simple example:

# old version
inter_matrix = torch.mm(flatten_masks, flatten_masks.transpose(1, 0))
# new version
inter_matrix = torch.einsum('ik, kj -> ij', flatten_masks, flatten_masks.transpose(1, 0))

I’m not sure what your use case is. Would you like to lower the memory usage on your GPU?

Yep, here is a script that I use to check the GPU memory and running time

import torch
import numpy as np
import time

flatten_masks = np.random.random((800, 60800))

flatten_masks = torch.from_numpy(flatten_masks).cuda(device=0)
print()
t1 = time.time()
i = 0
while i < 2500:
    if i == 500:
        t1 = time.time()
    # old version 
    inter_matrix = torch.mm(flatten_masks, flatten_masks.transpose(1, 0))
    # new version
    # inter_matrix = torch.einsum('ik, kj -> ij', flatten_masks, flatten_masks.transpose(1, 0))
    i += 1
t2 = time.time()
print(t2-t1)

result:

torch.mm torch.einsum
process time (s) 4.69856 4.5713
GPU memory (MiB) 807 807

And my question is what code I can use that I can save the memory?