I’ve been trying for days and still can’t solve the problem.
It always comes up with the above error. On line 248, I simply take out the label corresponding to the hidden. I make sure that this operation is without error.
for one_class in label_set:
source_index = source_label == one_class
target_index = target_label == one_class
select_source_hidden = source_hidden[source_index]
select_target_hidden = target_hidden[target_index]
An indexing operation fails and you could rerun the code via
CUDA_LAUNCH_BLOCKING=1 which would then point towards the failing operation in the stacktrace. Often e.g. an embedding layer is failing if the input tensor contains values which are out of bounds.
In fact, I have used CUDA_LAUNCH_BLOCKING=1 very early on, and it shows me a tensor passing error. But I can ensure that it is passed correctly.
Also, the CUDA error: device-side assert triggered is reproduced in many cases in the current project, and not in others. In several other projects, my virtual environment works fine without any errors. Therefore, I think there is some potential conflict in the package installation.
I spent a day reconfiguring the new virtual environment, but still could not resolve the issue.
What I can’t understand is that in the new virtual environment when setting DDPStrategy(find_unused_parameters=True) it does not give an error. When DDPStrategy(find_unused_parameters=Fasle), it gives the above error.
The code where the error occurs is always shown here. I just did a very simple operation here, i.e. I took out the corresponding hidden by label. source_hidden and target_hidden have the dimensions batch_size * hidden . The dimension of source_label and source_label is batch_size . I don’t think there is anything wrong with this code.
The line of code and the assert both point towards a failing indexing operation so I would recommend checking the tensor shapes and the values of the indexing tensors to make sure they contain only valid indices.