Questions about Dataloader and Dataset


(Yun Chen) #1

I run my dataloader like this:

dataset= ImageFolder('/home/x/data/pre/train/',
                      transform=transforms.Compose([transforms.Scale(opt.image_size),
                                              transforms.RandomCrop(opt.image_size) ,
                                              transforms.ToTensor(),
                                              transforms.Normalize([0.5]*3,[0.5]*3)
                                             ]))
 
dataloader=t.utils.data.DataLoader(dataset,opt.batch_size,True,num_workers=opt.workers)

but some images are corrupted, and it raised an error(an PIL error in default_loader: Image.open(path).convert('RGB')

my temporary fix is modifying /torchvision/datasets/folder.py Line 65 to

        try:
            img = self.loader(os.path.join(self.root, path))
        except Exception as e:
            index = index - 1 if index > 0 else index + 1 
            return self.__getitem__(index)

Is there a better way to solve this, i.e. modifying the code of dataloader to load another Image when exception or write a new loader.
or does it work if I simply return None when caught an exception?


#2

you can look at the default_collate function if it handles None or not.
Or you can give a custom collate function to your DataLoader which will handle None:


How to create a dataloader with variable-size input
Having a list as a label for a given image?
(Adam Paszke) #3

The DataLoader doesn’t support recovery from errors in datasets because it would be too complex to add and keep the guarantee that it will always return the batches in the exact same order, as the sampler generated. It’s not even clear what’s the behaviour that user expects (some people might want to be notified about the error, others don’t).


(Yun Chen) #4

thanks @smth @apaszke, that really makes me have deeper comprehension of dataloader.

At first I try:

def my_loader(path):
    try: 
        return Image.open(path).convert('RGB')
    except Exception as e:
        print e
def my_collate(batch):
    "Puts each data field into a tensor with outer dimension batch size"
    batch = filter (lambda x:x is not None, batch)
    return default_collate(batch)

dataset= ImageFolder('/home/x/train/',
            transform=transforms.Compose([transforms.ToTensor()]),
            loader = my_loader)
dataloader=t.utils.data.DataLoader(dataset,4,True,collate_fn=my_collate)

it raise exception, because transforms in dataset can’t handle None

so then I try this:

def my_collate(batch):
    batch = filter (lambda x:x is not None, batch)
    return default_collate(batch)
class MyImageFolder(ImageFolder):
    __init__ = ImageFolder.__init__
    def __getitem__(self, index):
        try: 
            return super(MyImageFolder, self).__getitem__(index)
        except Exception as e:
            print e

dataset= MyImageFolder('/home/x/train/', transform = transforms.Compose([transforms.ToTensor()..]) )
dataloader=t.utils.data.DataLoader(dataset, 4, True, collate_fn=my_collate)

not so pythonic, but it works.
and I think the best way maybe just cleaning the data.


Possible to skip bad items in data loader?
(Andrey Ponikar) #5

To solve this particular problem with corrupted images you can just add two lines before your code:

from PIL import ImageFile
ImageFile.LOAD_TRUNCATED_IMAGES = True

as suggested here


(Tim) #6

Hi, Chen. Does the batch size decrease when use this way to filter invalid data.


(Amogh Mannekote) #7

I wrote a PyTorch library that allows you to do exactly this - drop samples, handle bad items, use Transforms as Filters (and more). Check out nonechucks.

Using nonechucks, your code would look something like this:

dataset= ImageFolder('/home/x/data/pre/train/',
                      transform=transforms.Compose([transforms.Scale(opt.image_size),
                                              transforms.RandomCrop(opt.image_size) ,
                                              transforms.ToTensor(),
                                              transforms.Normalize([0.5]*3,[0.5]*3)
                                             ]))

import nonechucks as nc
dataset = nc.SafeDataset(dataset)
dataloader = nc.SafeDataLoader(dataset,opt.batch_size,True,num_workers=opt.workers)

# You can now use `dataloader` as though it was a regular DataLoader without
# having to worry about the bad samples!

Feel free to check out the documentation on the Github page!