Dear all,
I am running into a problem when training a segmentation net on Pascal VOC using multiple GPUs (DistributedDataParallel). Pytorch seems to expect targets to have label values only between 0 and n_classes–1 (inclusive). However, this does not consider the presence of the ignore label (whose value is often 255).
More specifically, I get the error:
aten/src/THCUNN/SpatialClassNLLCriterion.cu:106: Assertion t >= 0 && t < n_classes
failed.
In my case, n_classes = 21 (Pascal VOC 2012) and my targets have label values in [0, 20] + [255] (for the ignore label). I ignore the 255 label in the CrossEntropy cost function. How do I get targets with ignore labels to pass the assertion test?
Interestingly, I don’t get this assertion failure when DistributedDataParallel is not used.
I cannot reproduce the issue using a single device as well as DDP.
If I pass ignore_index=255
to the criterion, the code runs fine.
Could you post an executable code snippet to reproduce this issue, please?
Thanks @ptrblck . My MWE would be a bit convoluted as I am using DDP through pytorch lightning (I don’t think that is the source of this error, however). Could you please post the code snippet that works for you? I can modify that to illustrate the assertion failure.
I used the ImageNet example and set one target value to 1001
using a single GPU and DDP.
In both cases it crashed with the expected error message. After using ignore_index=1001
both runs passed.
Thanks! I went back and checked the code very carefully. I had forgotten to ignore the same in the validation dataset and that is what failed the assertion…