Hello all. I’ve been trying to train a model with a dice loss function that has two inputs, “logits” and “true” with shapes of [4, 6, 256, 256] and [4, 256, 256] respectively. Unfortunately, the training process crashes with the following error:
IndexError: index 68788682753 is out of bounds for dimension 0 with size 6
and it crashes on this line of code in particular:
true_1_hot = torch.eye(num_classes)[true].cuda()
num_classes = logits.shape
The annoying thing about this error is that it isn’t consistent; in two tests I’ve done, the first time it trains for 5 epochs before crashing with this error, and in the next one it trains for 150 epochs before crashing. Therefore, I can’t figure out how it would go through this exact code multiple iterations of training without any issues, and then decide to crash all of a sudden. The index # it crashes on with that error always remains the same however.
Any help would be greatly appreciated!
It looks like you’ve got some indexing problems. Maybe one sample in your dataset has a wrong value for label? and it appears randomly depending on when this label is drawn by the dataloader?
Also cuda errors are asynchronous so if the error is raised within cuda, you can add
CUDA_LAUNCH_BLOCKING=1 to make sure the error is raised at the right place.
Hi, thanks for the reply! I turned off the “shuffle” parameter for my dataloader, which I think should now guarantee that each epoch should go through all of the training images sequentially. So far, after a couple epochs of training, no error. What I figured was that if it was a mask image somewhere in the training set causing the issue, having the dataloader go through all of them SHOULD force the error to pop up, right?
Yes it should.
If you re-enable the shuffling the error happens again? Can you add some printing to check if all the Tensors have the right value.
If you can make a small code sample (~30 lines) that reproduce the problem, that would be very helpful !
I’m honestly not sure, but I may have fixed this issue through fixing another issue I had in another post earlier this week involving file paths for images and masks, and there possibly being mismatches between raw images and their corresponding masks. Perhaps that was causing this indexing issue? At the very least, I haven’t had any crashes since I made those changes to ensure all images and masks are matched correctly on Linux (and Windows), and I haven’t been able to replicate it.