Runtime Error : CUDA Error

RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasLtMatmul( ltHandle, computeDesc.descriptor(), &alpha_val, mat1_ptr, Adesc.descriptor(), mat2_ptr, Bdesc.descriptor(), &beta_val, result_ptr, Cdesc.descriptor(), result_ptr, Cdesc.descriptor(), &heuristicResult.algo, workspace.data_ptr(), workspaceSize, at::cuda::getCurrentCUDAStream())

or
RuntimeError: philox_cuda_state for an unexpected CUDA generator used during capture. In regions captured by CUDA graphs, you may only use the default CUDA RNG generator on the device that’s current when capture begins. If you need a non-default (user-supplied) generator, or a generator on another device, please file an issue.

This error is popping up while trying to train a transformer model from scratch in Colab. It runs on CPU, Though!

How to resolve ?

I am not familiar with Colab, but maybe you could post some potentially relevant code as well so it’s easier for us to diagnosis the cause of your issue?

Hi @Yuyao_Huang
nlp-basics-to-advanced/notebook.ipynb at main ¡ Suraj520/nlp-basics-to-advanced ¡ GitHub 's Cell 102

When I move the tensor tgt_mask to device other than cpu i.e cuda, The error pops up !

Snippet being

step=0
num_epochs=10
def fit(num_epochs,train_loader,val_loader, model, optimizer,criterion):
    model.train()
    for epoch in range(num_epochs):
        checkpoint = {'state_dict':model.state_dict(), 'optimizer':optimizer.state_dict()}
        if epoch%10==0:
            torch.save(checkpoint,"checkpoint.pth.tar")
        #loading batch
        for i,batch in enumerate(train_loader):
            input_seq = batch['input'].to(device)
            #input_mask = batch['input_mask'].to(device)
            output_seq = batch['output'].to(device)
            #output_mask = batch['output_mask'].to(device)
            input_padding_mask = input_seq == PAD_IDX
            output_padding_mask = output_seq == PAD_IDX
            #input_mask, output_mask, input_padding_mask, output_padding_mask = create_mask(input_seq, output_seq)
            memory_key_padding_mask = input_padding_mask.clone()
            input_padding_mask = rearrange(input_padding_mask, 'n s -> s n')
            output_padding_mask = rearrange(output_padding_mask, 'n s -> s n')
            memory_key_padding_mask = rearrange(memory_key_padding_mask, 'n s -> s n')
            
            tgt_sentence_len = output_seq.shape[0] - torch.sum(output_padding_mask,axis=1)
            tgt_inp, tgt_out = output_seq[:,:], output_seq[:,:]
            tgt_mask = gen_nopeek_mask(output_seq.shape[0])#.to('cuda') #('cuda)
            tgt_mask = tgt_mask.to(device)
            output = model(input_seq,tgt_inp,0,input_padding_mask, output_padding_mask,memory_key_padding_mask,tgt_mask)

            #generating one hot
            from_one_hot = torch.argmax(output,dim=2)

            output = output.view(-1, output.shape[-1])
            tgt_out = tgt_out.view(-1)
            loss = criterion(output,tgt_out)
            loss.backward()
            optimizer.step()

        #validation
        model.eval()
        with torch.no_grad():
            for i, batch in enumerate(val_loader):
                input_seq = batch['input'].to(device)
                output_seq = batch['output'].to(device)
                input_padding_mask = input_seq == PAD_IDX
                output_padding_mask = output_seq == PAD_IDX
                #input_mask, output_mask, input_padding_mask, output_padding_mask = create_mask(input_seq, output_seq)
                memory_key_padding_mask = input_padding_mask.clone()
                input_padding_mask = rearrange(input_padding_mask, 'n s -> s n')
                output_padding_mask = rearrange(output_padding_mask, 'n s -> s n')
                memory_key_padding_mask = rearrange(memory_key_padding_mask, 'n s -> s n')
                
                tgt_sentence_len = output_seq.shape[0] - torch.sum(output_padding_mask,axis=1)
                tgt_inp, tgt_out = output_seq[:,:], output_seq[:,:]
                tgt_mask = gen_nopeek_mask(output_seq.shape[0])#.to('cuda') #('cuda)
                tgt_mask = tgt_mask.to(device)
                output = model(input_seq,tgt_inp,0,input_padding_mask, output_padding_mask,memory_key_padding_mask,tgt_mask)

                #generating one hot
                from_one_hot = torch.argmax(output,dim=2)

                output = output.view(-1, output.shape[-1])
                src_words = indices_to_string("ENGLISH",input_seq)
                predicted_words = indices_to_string("GERMAN",from_one_hot)
                tgt_words = indices_to_string("GERMAN",output_seq)
                print("Input English word - {}".format(src_words))
                print('Output German word - {}'.format(predicted_words))
                print("Actual German Word - {}".format(tgt_words))
                output_seq = output_seq.view(-1)
                val_loss = criterion(output,output_seq)

        print("Epoch - {}, Train Loss - {}, Val Loss - {}".format(epoch,loss.item(),val_loss.item()))
        writer.add_scalar("Train loss",loss, global_step=step)
        step+=1    

Your code seems to be running into a sticky error, which wasn’t raised properly. Could you update to the latest nightly binary and rerun your code again to check if the error message improves, please?

1 Like

@ptrblck : Thanks a lot ! It got resolved. :smiley:

That’s great! Would you mind posting the issue and the solution as I would be interested in learning more about it?

Actually I was solving the problem of Machine Translation via Seq2Seq LSTM encoder-decoder model and via Seq2Seq Transformer architecture.

I swapped the Transformer model with Seq2Seq LSTM encoder-decoder model for the dataloaders to check if it’s caused by the dimensionality errors related to I/O of layers inside the model. Turns out, It didn’t work out for Seq2Seq LSTM encoder-decoder too. To be Noted, The Seq2Seq model trained fine on Colab previously.

Haven’t been able to ensure that it runs on Colab, yet!

Colab has lately been throwing the following error too ! Restricting pip installs within runtime.

NotImplementedError: A UTF-8 locale is required. Got ANSI_X3.4-1968

Resolution -

import locale
def getpreferredencoding(do_setlocale = True):
    return "UTF-8"
locale.getpreferredencoding = getpreferredencoding

After doing this, I upgraded to the latest nightly release too via the following command @

!pip install --pre torch torchvision -f https://download.pytorch.org/whl/nightly/cu117/torch_nightly.html -U --force

The error’s still not resolved !

Maybe, An error related to Google Colab !

Reference Notebooks

  1. Seq2Seq LSTM encoder-decoder - https://github.com/Suraj520/nlp-basics-to-advanced/blob/main/21_machine-translation/notebook.ipynb

  2. Seq2Seq Transformer - https://github.com/Suraj520/nlp-basics-to-advanced/blob/main/23_transformer-machine-translation/notebook.ipynb

I’m unsure if you are still running into the same issue or a new one using the latest nightly binaries.
In any case, could you rerun the code with CUDA_LAUNCH_BLOCKING=1 and check the stacktrace to see what exactly is failing?

Tried !

No resolve, Yet ! I tried to set CUDA_LAUNCH_BLOCKING=1 discretely without ensuring force nightly install of PyTorch since it’s giving an error

RuntimeError: cuDNN error: CUDNN_STATUS_MAPPING_ERROR

which I believe was related to mismatch of the version between CUDA and CUDNN on host instance(Colab)

The persistent error log remains to be -

RuntimeError: philox_cuda_state for an unexpected CUDA generator used during capture. In regions captured by CUDA graphs, you may only use the default CUDA RNG generator on the device that's current when capture begins. If you need a non-default (user-supplied) generator, or a generator on another device, please file an issue.

P.S : To ensure, Dataloader’s multiprocessing was not serving the cause - Turned num_workers =1, too.
Also tried with torch.backends.cudnn.benchmark = True

Since you are still getting random error messages, the env variable was most likely not set correctly.
Could you post a minimal and executable code snippet to reproduce the issue, please?

Thanks for your patience, I already set the environment variable using the following command

import os
os.environ['CUDA_LAUNCH_BLOCKING'] = "1"

An observation that I noticed was with or without the aforementioned snippet of setting environment variable, When I run the following training script of a simple sequential model in PyTorch. It runs effectively.

import torch
import math
import os
os.environ['CUDA_LAUNCH_BLOCKING'] = "1"

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

x = torch.linspace(-math.pi, math.pi, 2000)
y = torch.sin(x)
p = torch.tensor([1, 2, 3])
xx = x.unsqueeze(-1).pow(p)

model = torch.nn.Sequential(
    torch.nn.Linear(3, 1),
    torch.nn.Flatten(0, 1)
)
model = model.to(device)

loss_fn = torch.nn.MSELoss(reduction='sum')

learning_rate = 1e-6
for t in range(2000):
    xx = xx.to(device)
    y = y.to(device)
    y_pred = model(xx)
    loss = loss_fn(y_pred, y)
    if t % 100 == 99:
        print(t, loss.item())

    # Zero the gradients before running the backward pass.
    model.zero_grad()
    loss.backward()

STDOUT
cuda
99 17250.890625
199 17250.890625
299 17250.890625
399 17250.890625
499 17250.890625
599 17250.890625
699 17250.890625
799 17250.890625
899 17250.890625
999 17250.890625
1099 17250.890625
1199 17250.890625
1299 17250.890625
1399 17250.890625
1499 17250.890625
1599 17250.890625
1699 17250.890625
1799 17250.890625
1899 17250.890625
1999 17250.890625

But when I run the model with my data loader for the quoted task beforehand, neither it’s training loop starts, nor the simple sequential model’s training loop quoted above. In turn, both yield either the following Debug Log. It gets resolved when I restart the runtime and run the simple sequential model.

RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`

I am still looking into what my model (Seq2Seq model or dataloader) is breaking !

Let me know if you can share a minimal and executable code snippet to reproduce the issue, so that I could try to debug it.

1 Like

Here are the list of few things, I tried !

  1. Changed the transformer model to a simpler version. Snippet of which is quoted below
import torch 
import torch.nn as nn

INPUT_DIM = 358#len(eng_vocab.stoi) 3
OUTPUT_DIM = 617#len(ger_vocab.stoi)
dim_model = 256
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        
        self.english_embedding = nn.Embedding(INPUT_DIM, dim_model)
        self.german_embedding = nn.Embedding(OUTPUT_DIM, dim_model)
        self.transformer = nn.Transformer(d_model=dim_model, 
            num_encoder_layers=2, num_decoder_layers=2, 
            dropout=0.5, dim_feedforward=2048)
        self.fc1 = nn.Linear(dim_model, OUTPUT_DIM)
    
    def forward(self, inputs, targets):
        x = self.english_embedding(inputs)
        y = self.german_embedding(targets)
        tgt_mask = torch.triu(torch.ones(targets.size(0), targets.size(0)), diagonal=1).bool().to(device)
        out = self.transformer(x, y, tgt_mask=tgt_mask)
        out = self.fc1(out.permute(1, 0, 2)) # (batch, sequence, feature)
        return out.permute(1, 0, 2).reshape(-1, OUTPUT_DIM) # (sequence, batch, feature)

model = Net().to(device)
  1. Passed in a random tensor of size yielded by the data loader generator function where input_seq= english_language_tokens, output_seq = german_language_tokens. Snippet of which is quoted below
num_epochs = 1000
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss(ignore_index=0)
step=0

#Training
for epoch in range(num_epochs):
    model.train()
    input_seq = torch.randint(5, (4,2))
    output_seq = torch.randint(6,(6,2))
    input_seq = input_seq.to(device)
    output_seq = output_seq.to(device)
    pred = model(input_seq.to(device), output_seq[:-1,].to(device))
    loss = criterion(pred, output_seq[1:,].view(-1))
    loss.backward()
    optimizer.step()
    print("Epoch- {}, Loss- {}".format(epoch, loss))
    

Refer Image to see the dataloader’s I/O shape

If you run the snippets 1 and 2, You’ll see it runs perfectly !

  1. Reduced the batch size to 1.
    However, When I pass in the data loader for english and german language tokens(original ones from the dataset), They run for an hour or so before the error whose log is quoted below is generated
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasLtMatmul( ltHandle, computeDesc.descriptor(), &alpha_val, mat1_ptr, Adesc.descriptor(), mat2_ptr, Bdesc.descriptor(), &beta_val, result_ptr, Cdesc.descriptor(), result_ptr, Cdesc.descriptor(), &heuristicResult.algo, workspace.data_ptr(), workspaceSize, at::cuda::getCurrentCUDAStream())`

Refer Image -

  1. Did a sanity check of the dataloader to see if it breaks at some point or not !

Refer Image

The entire source code can be found on -
https://github.com/Suraj520/nlp-basics-to-advanced/blob/main/23_transformer-machine-translation/notebook.ipynb

Remarks

I found a similar issue - [Bug] RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)` when training tacotron2 ¡ Issue #1517 ¡ coqui-ai/TTS ¡ GitHub
Seems like it’s an issue related to Cuda Toolkit and PyTorch on Google Colab.

Unfortunately, your current code does not reproduce the issue, so that I won’t be able to debug it.

I doubt it and would claim your input tensors might contain vocabulary indices which might be out of bounds for e.g. your embeddings.
The error reporting might be broken if you are using torch==1.13.1 and the blocking launches env variable might be set too late so that you are running into CUDA errors which look random.

2 Likes

Thanks ! I think I missed the OOV> token while vocabulary building with the assumption that just the START> and END> should suffice. I did that with the assumption that vocabulary was built on the entire data frame content hence during training there won’t be any OOV> token encountered. Though, It maybe encountered during inference.

This is one of the error that maybe causing the highlighted error ! Will recheck the code.

The explanation looks intuitive enough, especially looking at how code breaks at 19k steps of epoch 0 !

Thanks for your time and patience in the thread ! :smiley:

Sure, let me know once you’ve isolated the issue.
For the sake of clarity: the error reporting on torch==1.13.1 was not properly capturing all CUDA asserts and was thus causing error messages, which are not really helpful.
Since you are seeing seemingly “random” CUDA errors, it seems you are also running into such a case.
The current nightly binaries would fix it, so you could also consider updating, which should then yield a proper error message again.

Sure, Noted !

P.S: The previous reply somehow discarded tags OOV, START and END mentions. Maybe, due to closed < pairs >.

What does this code do and why is it necessary?

import os
os.environ[‘CUDA_LAUNCH_BLOCKING’] = “1”

CUDA_LAUNCH_BLOCKING=1 is used for debugging and disables asynchronous kernel launches.
I would not recommend setting it inside your Python script but to properly export this environment variable in your terminal.