"RuntimeError: CUDA error: device-side assert triggered" when training a question answering model

Sathsara_Rasantha · December 27, 2020, 4:14am

Hi All,

I am trying to implement the BiDAF model for a Sinhala question answering dataset.
This dataset is created by translating a portion of SQuAD into Sinhala language using Google translate.
And I am using Google colab for this.
I found a code which is implemented BiDAF for SQuAD (english) dataset using pytorch.
And I modified it for my dataset.
I changed some data pre-processing steps and used fastText word embeddings instead of GloVe word embeddings.

I got lot of issues on the way through.
I solved many of them. But this one seems to be very difficult to understand because I am very new to pytorch.
This is my final year research project as well.
I kindly request someone to take a look at this issue and solve it for me. Any kind of help is highly appreciated. Thanks in advance.

This is the link to the Colab notebook : https://colab.research.google.com/drive/1zBn-jU_y-NbBOXR_eAi6j1lPxO-EhMSa?usp=sharing

Abhilash_Srivastava · December 27, 2020, 11:33am

You’ll need to be more specific than that. The more specific you are, the easier it will be for us to understand and suggest solutions.

From the outset, you can try disabling CUDA and first try to make it work end to end on a CPU (with a smaller dataset).

Sathsara_Rasantha · December 27, 2020, 1:55pm

@Abhilash_Srivastava Thank you very much for replying.
I added the notebook link for my code. And I thought it would help you guys.
Anyway I will try what you suggested.

Sathsara_Rasantha · December 27, 2020, 6:23pm

@Abhilash_Srivastava I tried running the code on cpu instead of gpu. Now I am getting a much more detailed trace of the error.

Starting training …
Starting batch: 0

IndexError Traceback (most recent call last)
in ()
----> 1 train(model, train_dataset)

2 frames
/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py in nll_loss(input, target, weight, size_average, ignore_index, reduce, reduction)
2262 .format(input.size(0), target.size(0)))
2263 if dim == 2:
→ 2264 ret = torch._C._nn.nll_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index)
2265 elif dim == 4:
2266 ret = torch._C._nn.nll_loss2d(input, target, weight, _Reduction.get_enum(reduction), ignore_index)

IndexError: Target 545 is out of bounds.

Can you explain what this error supposed to mean ?

ptrblck · December 27, 2020, 10:05pm

The error points towards an invalid target index. Assuming your max. target value is 545 it would mean you are dealing with 546 classes and the model output should have the shape [batch_size, 546] for a multi-class classification use case.
This error is raised, if the size of dim1 is smaller than the max. target index, so you would have to check the model output.

Sathsara_Rasantha · December 30, 2020, 11:02am

@ptrblck Thank you very much. This explanation seems very helpful. I’ll look at the model output.

"RuntimeError: CUDA error: device-side assert triggered" when training a question answering model

Starting training … Starting batch: 0

Starting training …
Starting batch: 0