RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)` while running fine on the CPU

I have a model for classification that works on CPU, but when I try to run the model on GPU using DataParallel I get the following error:

Traceback (most recent call last):
  File "Main.py", line 58, in <module>
    trainer.train(dl_train=train_loader, dl_validation=validation_loader)
  File "/home/dsi/davidsr/AttentionProj/Trainers.py", line 56, in train
    loss.backward()
  File "/home/dsi/davidsr/.local/lib/python3.6/site-packages/torch/tensor.py", line 221, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/dsi/davidsr/.local/lib/python3.6/site-packages/torch/autograd/__init__.py", line 132, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)`

I checked that the dimensions of the prediction and the ground truth are the same but I can’t seem to find why it raises this error.

I run my code with CUDA_LAUNCH_BLOCKING=1 and got:

davidsr@dgx02:~/AttentionProj$ CUDA_LAUNCH_BLOCKING=1, CUDA_VISIBLE_DEVICES=3 python3 Main.py --n_epochs 2 --lr 0.0001 --new_split 0 --mode train --par 0
Traceback (most recent call last):
  File "Main.py", line 58, in <module>
    trainer.train(dl_train=train_loader, dl_validation=validation_loader)
  File "/home/dsi/davidsr/AttentionProj/Trainers.py", line 55, in train
    loss = self.w_bce(y_hat, t)
  File "/home/dsi/davidsr/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/dsi/davidsr/AttentionProj/Losses.py", line 27, in forward
    pos = torch.matmul(torch.matmul(t, self.pos_w), torch.log(y_hat + self.eps))
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemv(handle, op, m, n, &alpha, a, lda, x, incx, &beta, y, incy)`

It appears my error is in the loss function so I added the implementation of it:

class WeightedBCE(nn.Module):
    def __init__(self, pos_w, neg_w):
        super(WeightedBCE, self).__init__()
        self.pos_w = torch.tensor(pos_w, dtype=torch.float, requires_grad=False)
        self.neg_w = torch.tensor(neg_w, dtype=torch.float, requires_grad=False)
        self.eps = 1e-10
        return

    def forward(self, y_hat, t):
        pos = torch.matmul(torch.matmul(t, self.pos_w), torch.log(y_hat + self.eps))
        neg = torch.matmul(torch.matmul(1 - t, self.pos_w), torch.log(1 - y_hat + self.eps))
        return torch.mean(pos + neg)

Hi,
Thanks for the report.
Could you give some details about:

  • Which GPU, cuda version and pytorch version you’re using?
  • What are the y_hat and t Tensor you give as input. In particular could you share for both t.size(), t.stride() and t.storage_offset()?

cc @ptrblck do you know what could be causing this?

It could be an OOM issue triggering the cublas error, but it might also be an internal cublas issue, so the setup information would be helpful to reproduce this issue on our side.

pytorch version 1.7.0
GPU - Tesla V100
NVIDIA-SMI 440.100 Driver Version: 440.100 CUDA Version: 10.2

Regarding the values you wanted for y_hat and t, the prints was made from within the forward in the loss function:


y_hat
torch.Size([12, 14])
(14, 1)
0

t
torch.Size([12, 14])
(14, 1)
0

Anything else you need?

Thanks! Could you also share the size of pos_w and neg_w? I though they were scalars but it does not run when they are set as scalars.

Yes, they are the size of the expected output (t or y_hat) it start as a numpy array of shape (1,14) and then transform it to torch.tensor.

torch.Size([1,14])

And they are constants

Hi,

I think there is still something I’m missing about sizes. Here is my attempt to repro but the shape don’t match. Could you give me an updated version will all the right shapes please? Thanks!

import torch
from torch import nn

class WeightedBCE(nn.Module):
    def __init__(self, pos_w, neg_w):
        super(WeightedBCE, self).__init__()
        pos_w = torch.tensor(pos_w, dtype=torch.float, requires_grad=False)
        neg_w = torch.tensor(neg_w, dtype=torch.float, requires_grad=False)
        self.register_buffer("pos_w", pos_w)
        self.register_buffer("neg_w", neg_w)
        self.eps = 1e-10
        return

    def forward(self, y_hat, t):
        pos = torch.matmul(torch.matmul(t, self.pos_w), torch.log(y_hat + self.eps))
        neg = torch.matmul(torch.matmul(1 - t, self.pos_w), torch.log(1 - y_hat + self.eps))
        return torch.mean(pos + neg)


y_hat = torch.rand(12, 14, device="cuda", requires_grad=True)
t = torch.rand(12, 14, device="cuda", requires_grad=True)

pos_w = torch.rand(1, 14).numpy()
neg_w = torch.rand(1, 14).numpy()
mod = WeightedBCE(pos_w, neg_w).cuda()

mod(y_hat, t)

You got the sizes right. This is what so weird.

I don’t really get how I can run it on my CPU but on the GPU it throws the error

The sizes are unfortunately not right, as the code still raises:

RuntimeError: mat1 dim 1 must match mat2 dim 0

so could you please double check the shapes and/or post an executable code snippet, so that we could debug it?

I run the code on the CPU, and this are the variables with the sizes:

y_hat size is: torch.Size([12, 14])
y_hat values:

tensor([[0.6129, 0.5309, 0.5493, 0.5381, 0.3087, 0.5583, 0.6137, 0.5149, 0.4085,
         0.5862, 0.2384, 0.5680, 0.5260, 0.4991],
        [0.6569, 0.5384, 0.5118, 0.5537, 0.2853, 0.5693, 0.5910, 0.5591, 0.4307,
         0.6214, 0.2627, 0.5559, 0.4875, 0.5189],
        [0.6661, 0.5641, 0.5279, 0.5512, 0.2661, 0.5696, 0.6380, 0.5634, 0.4121,
         0.6507, 0.2291, 0.5484, 0.5269, 0.5167],
        [0.6080, 0.5319, 0.5410, 0.5360, 0.3337, 0.5400, 0.6016, 0.5272, 0.4206,
         0.5621, 0.2836, 0.5702, 0.5462, 0.5181],
        [0.6449, 0.5319, 0.5020, 0.5396, 0.3185, 0.5530, 0.6014, 0.5320, 0.4364,
         0.5910, 0.2769, 0.5569, 0.5150, 0.5478],
        [0.6354, 0.5295, 0.5082, 0.5238, 0.3286, 0.5463, 0.6044, 0.5155, 0.4336,
         0.5611, 0.2828, 0.5773, 0.5416, 0.5526],
        [0.6253, 0.5377, 0.5210, 0.5412, 0.3121, 0.5338, 0.5989, 0.5410, 0.4344,
         0.5828, 0.2790, 0.5386, 0.4860, 0.5109],
        [0.5994, 0.5170, 0.5295, 0.5254, 0.3505, 0.5416, 0.5829, 0.5294, 0.4342,
         0.5438, 0.2987, 0.5563, 0.5430, 0.5301],
        [0.7066, 0.6006, 0.5521, 0.5716, 0.2426, 0.5623, 0.7090, 0.5899, 0.3890,
         0.7019, 0.1959, 0.5618, 0.5408, 0.5389],
        [0.6898, 0.5965, 0.5376, 0.5712, 0.2483, 0.5569, 0.6715, 0.5582, 0.4050,
         0.6706, 0.2231, 0.5759, 0.5098, 0.5386],
        [0.7319, 0.6521, 0.5477, 0.6104, 0.1955, 0.5622, 0.7457, 0.5822, 0.3798,
         0.7330, 0.1674, 0.6016, 0.5055, 0.5282],
        [0.6339, 0.5636, 0.5512, 0.5806, 0.3190, 0.5315, 0.5829, 0.5866, 0.4368,
         0.5918, 0.2504, 0.5495, 0.5461, 0.4653]], grad_fn=<SigmoidBackward>)

t size is: torch.Size([12, 14])
t values:

tensor([[0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1.],
        [0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
        [0., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 1., 0., 0., 1., 0., 0., 0., 0.],
        [0., 0., 0., 0., 1., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],
        [0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.],
        [0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.]])

pos_w size is: torch.Size([14])
pos_w values:

tensor([0.7754, 0.9461, 0.9093, 0.9552, 0.7428, 0.9514, 0.9681, 0.9956, 0.6159,
        0.8887, 0.8778, 0.9344, 0.9717, 0.8984])

neg_w size is: torch.Size([14])
neg_w values:

tensor([0.2246, 0.0539, 0.0907, 0.0448, 0.2572, 0.0486, 0.0319, 0.0044, 0.3841,
        0.1113, 0.1222, 0.0656, 0.0283, 0.1016])

I was indeed mistaken and I apologize for that.

Ho right!
So the updated code below runs!
But at least on the P100 of colab, it does not raise any error :confused:

@ptrblck would you have a V100 handy to test this?

import torch
from torch import nn

class WeightedBCE(nn.Module):
    def __init__(self, pos_w, neg_w):
        super(WeightedBCE, self).__init__()
        pos_w = torch.tensor(pos_w, dtype=torch.float, requires_grad=False)
        neg_w = torch.tensor(neg_w, dtype=torch.float, requires_grad=False)
        self.register_buffer("pos_w", pos_w)
        self.register_buffer("neg_w", neg_w)
        self.eps = 1e-10
        return

    def forward(self, y_hat, t):
        pos = torch.matmul(torch.matmul(t, self.pos_w), torch.log(y_hat + self.eps))
        neg = torch.matmul(torch.matmul(1 - t, self.pos_w), torch.log(1 - y_hat + self.eps))
        return torch.mean(pos + neg)


y_hat = torch.rand(12, 14, device="cuda", requires_grad=True)
t = torch.rand(12, 14, device="cuda", requires_grad=True)

pos_w = torch.rand(14).numpy()
neg_w = torch.rand(14).numpy()
mod = WeightedBCE(pos_w, neg_w).cuda()

mod(y_hat, t)

I also tested it on a machine with TITAN RTX and it also works.

1 Like

@albanD Sure, I can test it.

The code runs fine on a machine using a V100 DGXs-16GB (driver 440.33.01) and V100-SXM3-32GB (driver 450.51.06) using the conda PyTorch binaries for 1.7.0 and 1.7.1 with the CUDA runtime 10.2.