HELP : RuntimeError: CUDA error: device-side assert triggered

guanghui_ma · June 19, 2023, 3:54am

I’ve been trying for days and still can’t solve the problem.

It always comes up with the above error. On line 248, I simply take out the label corresponding to the hidden. I make sure that this operation is without error.

    for one_class in label_set:
        source_index = source_label == one_class
        target_index = target_label == one_class
        select_source_hidden = source_hidden[source_index]
        select_target_hidden = target_hidden[target_index]

ptrblck · June 19, 2023, 6:11am

An indexing operation fails and you could rerun the code via CUDA_LAUNCH_BLOCKING=1 which would then point towards the failing operation in the stacktrace. Often e.g. an embedding layer is failing if the input tensor contains values which are out of bounds.

guanghui_ma · June 20, 2023, 2:35am

In fact, I have used CUDA_LAUNCH_BLOCKING=1 very early on, and it shows me a tensor passing error. But I can ensure that it is passed correctly.
Also, the CUDA error: device-side assert triggered is reproduced in many cases in the current project, and not in others. In several other projects, my virtual environment works fine without any errors. Therefore, I think there is some potential conflict in the package installation.

I spent a day reconfiguring the new virtual environment, but still could not resolve the issue.

guanghui_ma · June 20, 2023, 2:37am

What I can’t understand is that in the new virtual environment when setting DDPStrategy(find_unused_parameters=True) it does not give an error. When DDPStrategy(find_unused_parameters=Fasle), it gives the above error.

guanghui_ma · June 20, 2023, 2:41am

guanghui_ma:

    for one_class in label_set:
        source_index = source_label == one_class
        target_index = target_label == one_class
        select_source_hidden = source_hidden[source_index]
        select_target_hidden = target_hidden[target_index]

The code where the error occurs is always shown here. I just did a very simple operation here, i.e. I took out the corresponding hidden by label. source_hidden and target_hidden have the dimensions batch_size * hidden . The dimension of source_label and source_label is batch_size . I don’t think there is anything wrong with this code.

ptrblck · June 20, 2023, 8:05am

The line of code and the assert both point towards a failing indexing operation so I would recommend checking the tensor shapes and the values of the indexing tensors to make sure they contain only valid indices.