I have a model for classification that works on CPU, but when I try to run the model on GPU using DataParallel I get the following error:
Traceback (most recent call last):
File "Main.py", line 58, in <module>
trainer.train(dl_train=train_loader, dl_validation=validation_loader)
File "/home/dsi/davidsr/AttentionProj/Trainers.py", line 56, in train
loss.backward()
File "/home/dsi/davidsr/.local/lib/python3.6/site-packages/torch/tensor.py", line 221, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/home/dsi/davidsr/.local/lib/python3.6/site-packages/torch/autograd/__init__.py", line 132, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)`
I checked that the dimensions of the prediction and the ground truth are the same but I can’t seem to find why it raises this error.
I run my code with CUDA_LAUNCH_BLOCKING=1
and got:
davidsr@dgx02:~/AttentionProj$ CUDA_LAUNCH_BLOCKING=1, CUDA_VISIBLE_DEVICES=3 python3 Main.py --n_epochs 2 --lr 0.0001 --new_split 0 --mode train --par 0
Traceback (most recent call last):
File "Main.py", line 58, in <module>
trainer.train(dl_train=train_loader, dl_validation=validation_loader)
File "/home/dsi/davidsr/AttentionProj/Trainers.py", line 55, in train
loss = self.w_bce(y_hat, t)
File "/home/dsi/davidsr/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/dsi/davidsr/AttentionProj/Losses.py", line 27, in forward
pos = torch.matmul(torch.matmul(t, self.pos_w), torch.log(y_hat + self.eps))
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemv(handle, op, m, n, &alpha, a, lda, x, incx, &beta, y, incy)`
It appears my error is in the loss function so I added the implementation of it:
class WeightedBCE(nn.Module):
def __init__(self, pos_w, neg_w):
super(WeightedBCE, self).__init__()
self.pos_w = torch.tensor(pos_w, dtype=torch.float, requires_grad=False)
self.neg_w = torch.tensor(neg_w, dtype=torch.float, requires_grad=False)
self.eps = 1e-10
return
def forward(self, y_hat, t):
pos = torch.matmul(torch.matmul(t, self.pos_w), torch.log(y_hat + self.eps))
neg = torch.matmul(torch.matmul(1 - t, self.pos_w), torch.log(1 - y_hat + self.eps))
return torch.mean(pos + neg)