Questions about Dataloader and Dataset

I run my dataloader like this:

dataset= ImageFolder('/home/x/data/pre/train/',
                      transform=transforms.Compose([transforms.Scale(opt.image_size),
                                              transforms.RandomCrop(opt.image_size) ,
                                              transforms.ToTensor(),
                                              transforms.Normalize([0.5]*3,[0.5]*3)
                                             ]))
 
dataloader=t.utils.data.DataLoader(dataset,opt.batch_size,True,num_workers=opt.workers)

but some images are corrupted, and it raised an error(an PIL error in default_loader: Image.open(path).convert('RGB')

my temporary fix is modifying /torchvision/datasets/folder.py Line 65 to

        try:
            img = self.loader(os.path.join(self.root, path))
        except Exception as e:
            index = index - 1 if index > 0 else index + 1 
            return self.__getitem__(index)

Is there a better way to solve this, i.e. modifying the code of dataloader to load another Image when exception or write a new loader.
or does it work if I simply return None when caught an exception?

11 Likes

you can look at the default_collate function if it handles None or not.
Or you can give a custom collate function to your DataLoader which will handle None:

7 Likes

The DataLoader doesn’t support recovery from errors in datasets because it would be too complex to add and keep the guarantee that it will always return the batches in the exact same order, as the sampler generated. It’s not even clear what’s the behaviour that user expects (some people might want to be notified about the error, others don’t).

thanks @smth @apaszke, that really makes me have deeper comprehension of dataloader.

At first I try:

def my_loader(path):
    try: 
        return Image.open(path).convert('RGB')
    except Exception as e:
        print e
def my_collate(batch):
    "Puts each data field into a tensor with outer dimension batch size"
    batch = filter (lambda x:x is not None, batch)
    return default_collate(batch)

dataset= ImageFolder('/home/x/train/',
            transform=transforms.Compose([transforms.ToTensor()]),
            loader = my_loader)
dataloader=t.utils.data.DataLoader(dataset,4,True,collate_fn=my_collate)

it raise exception, because transforms in dataset can’t handle None

so then I try this:

def my_collate(batch):
    batch = filter (lambda x:x is not None, batch)
    return default_collate(batch)
class MyImageFolder(ImageFolder):
    __init__ = ImageFolder.__init__
    def __getitem__(self, index):
        try: 
            return super(MyImageFolder, self).__getitem__(index)
        except Exception as e:
            print e

dataset= MyImageFolder('/home/x/train/', transform = transforms.Compose([transforms.ToTensor()..]) )
dataloader=t.utils.data.DataLoader(dataset, 4, True, collate_fn=my_collate)

not so pythonic, but it works.
and I think the best way maybe just cleaning the data.

21 Likes

To solve this particular problem with corrupted images you can just add two lines before your code:

from PIL import ImageFile
ImageFile.LOAD_TRUNCATED_IMAGES = True

as suggested here

3 Likes

Hi, Chen. Does the batch size decrease when use this way to filter invalid data.

1 Like

I wrote a PyTorch library that allows you to do exactly this - drop samples, handle bad items, use Transforms as Filters (and more). Check out nonechucks.

Using nonechucks, your code would look something like this:

dataset= ImageFolder('/home/x/data/pre/train/',
                      transform=transforms.Compose([transforms.Scale(opt.image_size),
                                              transforms.RandomCrop(opt.image_size) ,
                                              transforms.ToTensor(),
                                              transforms.Normalize([0.5]*3,[0.5]*3)
                                             ]))

import nonechucks as nc
dataset = nc.SafeDataset(dataset)
dataloader = nc.SafeDataLoader(dataset,opt.batch_size,True,num_workers=opt.workers)

# You can now use `dataloader` as though it was a regular DataLoader without
# having to worry about the bad samples!

Feel free to check out the documentation on the Github page!

3 Likes

How you get the internal function ‘default_collate’? It is within the ‘dataloder.py’.

1 Like

It’s imported under that module, just not directly under .data:

import torch
import torch.utils.data
print(torch.utils.data.dataloader.default_collate)

Best regards

Thomas

3 Likes

Note: for python 3, replace

batch = filter(lambda x : x is not None, batch)

with

batch = list(filter(lambda x : x is not None, batch))

8 Likes

I am getting an error:

samples = collate_fn([dataset[i] for i in batch_indices])
TypeError: ‘DataLoader’ object does not support indexing

What am I doing wrong.

hi, i have similar problem about share memory, could you help me to solve the problem ? thanks very much, the topic link is :

This does not work if the batch size is 1, since default_collate will be called with an empty list in that case. The same thing could happen if the batch size is larger and all the data samples in the batch size are corrupt (highly unlikely though).

Is there a way to get this to work when the batch size is only 1?

3 Likes

I faced the same issue. my hot fix was to return a datapoint drawn randomly from the dataset.
so basically, catch the error and return a valid batch with a single datapoint.
It can happen in other cases where the batch_size > 1,but last batch has only 1 datapoint.

from torch.utils.data.dataloader import default_collate

1 Like

Did you find any fix?

Edit: I found a fix here. It has many other relevant answers as well.

I also found a fix inspired by here.

def collate_fn_replace_corrupted(batch, dataset):
    if batch[0] is None:
        batch = [dataset[randint(0, len(dataset) - 1)]]
        return collate_fn_replace_corrupted(batch, dataset)

    return torch.utils.data.dataloader.default_collate(batch)

  collate_fn = functools.partial(collate_fn_replace_corrupted, dataset = dataset)
1 Like