Fault-tolerant Dataset/DataLoader

Olivier-CR · October 15, 2021, 12:00pm

Hi,

I have a PyTorch seg training running on a SageMaker-managed EC2, and training errors after 1000 minibatches with the training Dataloader breaking because “RuntimeError: The size of tensor a (4) must match the size of tensor b (3) at non-singleton dimension 0”

What could cause this? One bad record in the dataset? Is there a way to build fault-tolerant datasets/data loaders that don’t break if one record is bad?..

ptrblck · October 15, 2021, 11:53pm

The error is most likely caused by an image with 4 channels (additional alpha channel), while the rest seem to use 3 channels (RGB only).
You could iterate your DataLoader once and make sure it’s able to create the batches.
How would a “fault-tolerant” dataset work? I.e. you could use a try/except block in the Dataset.__getitem__, but I wouldn’t know what you want to return in case the sample is broken.
E.g. returning a random tensor instead or a static tensor would interfere with your training.

Olivier-CR · October 18, 2021, 8:17am

Indeed it was! a 4-channel PNG hidden among my 40k 3-channel PNGs… forcing Torchvision read_image to read to RGB fixed the issue

Olivier-CR · October 18, 2021, 4:05pm

yes I guess the easiest option here is to make a try/except in the DataLoader to discard full batches that cause errors. Would you know how to do that? I cannot find resources for custom DataLoader class creation (this page is named " Developing Custom PyTorch Dataloaders" but customizes only dataset and transforms, not the dataloader class itself)