Semantic segmentation Assertion `t >= 0 && t < n_classes` failed on DistributedDataParallel

Dear all,

I am running into a problem when training a segmentation net on Pascal VOC using multiple GPUs (DistributedDataParallel). Pytorch seems to expect targets to have label values only between 0 and n_classes–1 (inclusive). However, this does not consider the presence of the ignore label (whose value is often 255).

More specifically, I get the error:
aten/src/THCUNN/SpatialClassNLLCriterion.cu:106: Assertion t >= 0 && t < n_classes failed.

In my case, n_classes = 21 (Pascal VOC 2012) and my targets have label values in [0, 20] + [255] (for the ignore label). I ignore the 255 label in the CrossEntropy cost function. How do I get targets with ignore labels to pass the assertion test?

Interestingly, I don’t get this assertion failure when DistributedDataParallel is not used.

I cannot reproduce the issue using a single device as well as DDP.
If I pass ignore_index=255 to the criterion, the code runs fine.

Could you post an executable code snippet to reproduce this issue, please?

Thanks @ptrblck . My MWE would be a bit convoluted as I am using DDP through pytorch lightning (I don’t think that is the source of this error, however). Could you please post the code snippet that works for you? I can modify that to illustrate the assertion failure.

I used the ImageNet example and set one target value to 1001 using a single GPU and DDP.
In both cases it crashed with the expected error message. After using ignore_index=1001 both runs passed.

Thanks! I went back and checked the code very carefully. I had forgotten to ignore the same in the validation dataset and that is what failed the assertion…