Please: need HELP: I got this error "CUDA error: device-side assert triggered..."

Abir_ELTAIEF · January 18, 2022, 9:57pm

Please, I need help to run my model and I am stuck!
I try here to train a Siamese BERT model, using a particular dataset (that I transformed in dataloader…)

But I got this error (the GPU is : ‘Tesla P100-PCIE-16GB’)

RuntimeError Traceback (most recent call last)
/tmp/ipykernel_34/2427691835.py in
11 optim.zero_grad()
12 # prepare batches and more all to the active device
—> 13 inputs_ids_a = batch[‘code_file1_input_ids’].to(device)
14 inputs_ids_b = batch[‘code_file2_input_ids’].to(device)
15 attention_a = batch[‘code_file1_attention_mask’].to(device)

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

from tqdm.auto import tqdm

for epoch in range(4):
    model.train()  # make sure model is in training mode
    # initialize the dataloader loop with tqdm (tqdm == progress bar)
    loop = tqdm(loader, leave=True)
    for batch in loop:
        # zero all gradients on each new step
        optim.zero_grad()
        # prepare batches and more all to the active device
        inputs_ids_a = batch['code_file1_input_ids'].to(device)
        inputs_ids_b = batch['code_file2_input_ids'].to(device)
        attention_a = batch['code_file1_attention_mask'].to(device)
        attention_b = batch['code_file2_attention_mask'].to(device)
        label = batch['similar_or_different'].to(device)
        # extract token embeddings from BERT
        u = model(inputs_ids_a, attention_mask=attention_a)[0]  # all token embeddings A
        v = model(inputs_ids_b, attention_mask=attention_b)[0]  # all token embeddings B
        .........
       ..............
        # process concatenated tensor through FFNN
        x = ffnn(x)
        # calculate the 'softmax-loss' between predicted and true label
        loss = loss_func(x, label)
        # using loss, calculate gradients and then optimize
        loss.backward()
        optim.step()

Knowing that: I reduced the batch_size to 1
And when I tried, just before this code, this one: It passes!!!

from tqdm.auto import tqdm
loop = tqdm(loader, leave=True)
for batch in loop:
    inputs_ids_a = batch['code_file1_input_ids'].to(device)

Here the output (I couldn’t add the other screenshot becaue i was told new users can’t integrate two images… )

ptrblck · January 18, 2022, 10:06pm

Rerun the code via CUDA_LAUNCH_BLOCKING=1 python script.py args as described in the error message or on the CPU to get the stacktrace pointing to the line of code causing the issue.
Often these asserts are triggered by e.g. an invalid indexing operation i

Abir_ELTAIEF · January 18, 2022, 10:53pm

Yes I turned it on the CPU,and yes the error is about “IndexError: index out of range in self” at these lines of code :

u = model(inputs_ids_a, attention_mask=attention_a)[0]  
v = model(inputs_ids_b, attention_mask=attention_b)[0] 
.....

I got this error:

IndexError Traceback (most recent call last)

in ()
18 # extract token embeddings from BERT
19 u = model(inputs_ids_a, attention_mask=attention_a)[0]
—> 20 v = model(inputs_ids_b, attention_mask=attention_b)[0]
21 # get the mean pooled vectors
22 u = mean_pool(u, attention_a)
/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
2042 # remove once script supports set_grad_enabled
2043 no_grad_embedding_renorm(weight, input, max_norm, norm_type)
→ 2044 return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
2045
2046

IndexError: index out of range in self

I tried to break down the code to figure it out:
Here the problem, at the last line of code : I got the same error, above

from tqdm.auto import tqdm

loop = tqdm(loader, leave=True)
for batch in loop:
  inputs_ids_a = batch['code_file1_input_ids'].to(device)
  inputs_ids_b = batch['code_file2_input_ids'].to(device)
  attention_a = batch['code_file1_attention_mask'].to(device)
  attention_b = batch['code_file2_attention_mask'].to(device)
  label = batch['similar_or_different'].to(device)
  u = model(inputs_ids_a, attention_mask=attention_a)[0]

But I don’t get it.Could you please point me in the right direction?What to do and thanks:
Knowing that my dataloder is :

all_cols = ['similar_or_different']
for part in ['code_file1','code_file2']:
    train_code_net = train_code_net.map(lambda x: tokenizer(x[part], max_length=128, padding='max_length',truncation=True),
                                        batched=True)
    for col in ['input_ids', 'attention_mask']:
        train_code_net = train_code_net.rename_column(col, part+'_'+col)
        all_cols.append(part+'_'+col)
print(all_cols)

Here a screenshot with the features:

and the original dataset (that I transformed in a dataloader) is a simple dataset :

train_code_net

Dataset({
    features: ['code_file1', 'code_file2', 'similar_or_different'],
    num_rows: 10000
})

ptrblck · January 19, 2022, 6:47am

The indexing error is raised in:

return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)

so check the min and max values of input and make sure they match the weight dimension, i.e. the input should contain values in [0, num_embeddings-1].
Here is an example of a working and failing embedding operation:

emb = nn.Embedding(num_embeddings=10, embedding_dim=100)

x = torch.tensor([0, 5, 9]) # valid indices as they are in [0, num_embeddings-1]
out = emb(x)

x = torch.tensor([0, 5, 10]) # invalid, since 10 is out of bounds
out = emb(x)
# > IndexError: index out of range in self

Abir_ELTAIEF · January 20, 2022, 3:44pm

Thanks a lot, I finally found the problem.
Yes, the problem is in my input: the vocab_size of the model (Bert (base)) is 30522, and my tokenizer is not Bertokenizer. It is another tokenizer built with the same architecture (the same tokenization algorithm as Bert, only the vocabulary is different (not english…)), but the problem is that when building the new tokenizer (from the old: Bert tokenizer), I set my vocab_size to 50000 (so the model crashes when encountring tokens ids > 30522, in my tokenized dataset…)

I think I will build another tokenizer (from RoBERTa, having approximately this vocab_size of 50000, but also having another tokenization algorithm: BBPE, not WordPiece as Bert base), and I will replace Bert base with RoBERTa for the rest…
Thank you very much once more (in fact, it is clear that the model will crash, but I built my tokenizer a while ago, and lost sight of it…).

Abir_ELTAIEF · January 22, 2022, 9:35am

I solve it: It is Ok now. Thanks.