Cublas runtime error with RTX 2080ti

Recently, our lab bought a new server with 9 GPUs and I want to run my previous programming on this machine. However, I do not make any change about my code. I can not run the right code successfully on the new machine (RTX 2080Ti) and I got the following error.

THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1535491974311/work/aten/src/THC/THCGeneral.cpp line=663 error=11 : invalid argument
Traceback (most recent call last):
File “main.py”, line 166, in
p_img.copy_(netG(p_z).detach())
File “/usr/local/anaconda3/envs/tensorflow/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 477, in call
result = self.forward(*input, **kwargs)
File “/home/szhangcj/python/GBGAN/celebA_attention/sagan_models.py”, line 100, in forward
out,p1 = self.attn1(out)
File “/usr/local/anaconda3/envs/tensorflow/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 477, in call
result = self.forward(*input, **kwargs)
File “/home/szhangcj/python/GBGAN/celebA_attention/sagan_models.py”, line 32, in forward
energy = torch.bmm(proj_query,proj_key) # transpose check
RuntimeError: cublas runtime error : the GPU program failed to execute at /opt/conda/conda-bld/pytorch_1535491974311/work/aten/src/THC/THCBlas.cu:411

It seems that this error is caused by the detach() function. The following is my code.

z_b = torch.FloatTensor(opt.batch_size, opt.z_dim).to(device)
img_b = torch.FloatTensor(opt.batch_size, 3, 64, 64).to(device)
img_a = torch.FloatTensor(opt.batch_size, 3, 64, 64).to(device)
p_z = torch.FloatTensor(pool_size, opt.z_dim).to(device)
p_img = torch.FloatTensor(pool_size, 3, 64, 64).to(device)

show_z_b = torch.FloatTensor(100, opt.z_dim).to(device)
eval_z_b = torch.FloatTensor(250, opt.z_dim).to(device) # 250/batch * 120 --> 300000

optim_D = optim.Adam(netD.parameters(), lr=opt.lr_d) # other param?
optim_G = optim.Adam(netG.parameters(), lr=opt.lr_g) #?suitable
criterion_G = nn.MSELoss()

eta = 1
loss_GD = []
pre_loss = 0
cur_loss = 0
G_epoch = 1

for epoch in range(start_epoch, start_epoch + opt.num_epoch):

print('Start epoch: %d' % epoch)

## input_pool: [pool_size, opt.z_dim] -> [pool_size, 32, 32]
netD.train()

netG.eval()

p_z.normal_()

print(netG(p_z).detach().size())

p_img.copy_(netG(p_z).detach())

for t in range(opt.period): 

    for _ in range(opt.dsteps):
        
        t = time.time()
        ### Update D
        netD.zero_grad()
        ## real
        real_img, _ = next(iter(dataloader)) # [batch_size, 1, 32, 32]
        img_b.copy_(real_img.squeeze().to(device))
        real_D_err = torch.log(1 + torch.exp(-netD(img_b))).mean()
        print("D real loss", netD(img_b).mean())
        # real_D_err.backward()

        ## fake
        z_b_idx = random.sample(range(pool_size), opt.batch_size)
        img_a.copy_(p_img[z_b_idx])
        fake_D_err = torch.log(1 + torch.exp(netD(img_a))).mean() # torch scalar[]
        loss_gp = calc_gradient_penalty(netD, img_b, img_a)
        total_loss = real_D_err + fake_D_err + loss_gp
        print("D fake loss", netD(img_a).mean())
        total_loss.backward()

        optim_D.step()

    ## update input pool            
    p_img_t = p_img.clone().to(device)
    p_img_t.requires_grad_(True)
    if p_img_t.grad is not None:
        p_img_t.grad.zero_()
    fake_D_score = netD(p_img_t)
    
    fake_D_score.backward(torch.ones(len(p_img_t)).to(device))

    
    p_img = img_truncate(p_img + eta * p_img_t.grad)
    print("The mean of gradient", torch.mean(p_img_t.grad))

Hi,

Can you run your code with CUDA_LAUNCH_BLOCKING=1 to get a proper stack trace and report it here please?

I have tried to implement as you mentioned. However, I still get the same problem.

Hi,

By running your script as CUDA_LAUNCH_BLOCKING=1 python your_script.py, you will get the same error but a potentially different stacktrace. This is because this disable the asynchronous part of the cuda api and the error will be raised as soon as it happens and not on the next cuda op that is unrelated.

The following is the error. However, I do not get the error when I run my programming with the four Tesla V100s. How can I solve the problem? Thank you!
Start epoch: 0
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1535491974311/work/aten/src/THC/THCGeneral.cpp line=663 error=11 : invalid argument
Traceback (most recent call last):
File “main.py”, line 166, in
p_img.copy_(netG(p_z).detach())
File “/usr/local/anaconda3/envs/tensorflow/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 477, in call
result = self.forward(*input, **kwargs)
File “/home/szhangcj/python/GBGAN/celebA_attention/sagan_models.py”, line 100, in forward
out,p1 = self.attn1(out)
File “/usr/local/anaconda3/envs/tensorflow/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 477, in call
result = self.forward(*input, **kwargs)
File “/home/szhangcj/python/GBGAN/celebA_attention/sagan_models.py”, line 32, in forward
energy = torch.bmm(proj_query,proj_key) # transpose check
RuntimeError: cublas runtime error : the GPU program failed to execute at /opt/conda/conda-bld/pytorch_1535491974311/work/aten/src/THC/THCBlas.cu:411

Thanks for the error,
Maybe @ngimel should be able to help you investigate that further?