Questions about Dataloader and Dataset

chenyuntc · March 1, 2017, 9:19am

I run my dataloader like this:

dataset= ImageFolder('/home/x/data/pre/train/',
                      transform=transforms.Compose([transforms.Scale(opt.image_size),
                                              transforms.RandomCrop(opt.image_size) ,
                                              transforms.ToTensor(),
                                              transforms.Normalize([0.5]*3,[0.5]*3)
                                             ]))
 
dataloader=t.utils.data.DataLoader(dataset,opt.batch_size,True,num_workers=opt.workers)

but some images are corrupted, and it raised an error(an PIL error in default_loader: Image.open(path).convert('RGB')

my temporary fix is modifying /torchvision/datasets/folder.py Line 65 to

        try:
            img = self.loader(os.path.join(self.root, path))
        except Exception as e:
            index = index - 1 if index > 0 else index + 1 
            return self.__getitem__(index)

Is there a better way to solve this, i.e. modifying the code of dataloader to load another Image when exception or write a new loader.
or does it work if I simply return None when caught an exception?

smth · March 1, 2017, 11:45am

you can look at the default_collate function if it handles None or not.
Or you can give a custom collate function to your DataLoader which will handle None:

github.com

pytorch/pytorch/blob/master/torch/utils/data/dataloader.py#L32


_use_shared_memory = False
"""Whether to use shared memory in default_collate"""




class ExceptionWrapper(object):
"Wraps an exception plus traceback to communicate across threads"


def __init__(self, exc_info):
    self.exc_type = exc_info[0]
    self.exc_msg = "".join(traceback.format_exception(*exc_info))




def _worker_loop(dataset, index_queue, data_queue, collate_fn, seed, init_fn, worker_id):
global _use_shared_memory
_use_shared_memory = True


# Intialize C side signal handlers for SIGBUS and SIGSEGV. Python signal
# module's handlers are executed after Python returns from C low-level
# handlers, likely when the same fatal signal happened again already.
# https://docs.python.org/3/library/signal.html Sec. 18.8.1.1
_set_worker_signal_handlers()

github.com

pytorch/pytorch/blob/master/torch/utils/data/dataloader.py#L61-L84






def _worker_manager_loop(in_queue, out_queue, done_event, pin_memory, device_id):
if pin_memory:
    torch.cuda.set_device(device_id)


while True:
    try:
        r = in_queue.get()
    except Exception:
        if done_event.is_set():
            return
        raise
    if r is None:
        break
    if isinstance(r[1], ExceptionWrapper):
        out_queue.put(r)
        continue
    idx, batch = r
    try:

This file has been truncated. show original

apaszke · March 1, 2017, 11:49am

The DataLoader doesn’t support recovery from errors in datasets because it would be too complex to add and keep the guarantee that it will always return the batches in the exact same order, as the sampler generated. It’s not even clear what’s the behaviour that user expects (some people might want to be notified about the error, others don’t).

chenyuntc · March 1, 2017, 1:50pm

thanks @smth @apaszke, that really makes me have deeper comprehension of dataloader.

At first I try:

def my_loader(path):
    try: 
        return Image.open(path).convert('RGB')
    except Exception as e:
        print e
def my_collate(batch):
    "Puts each data field into a tensor with outer dimension batch size"
    batch = filter (lambda x:x is not None, batch)
    return default_collate(batch)

dataset= ImageFolder('/home/x/train/',
            transform=transforms.Compose([transforms.ToTensor()]),
            loader = my_loader)
dataloader=t.utils.data.DataLoader(dataset,4,True,collate_fn=my_collate)

it raise exception, because transforms in dataset can’t handle None

so then I try this：

def my_collate(batch):
    batch = filter (lambda x:x is not None, batch)
    return default_collate(batch)
class MyImageFolder(ImageFolder):
    __init__ = ImageFolder.__init__
    def __getitem__(self, index):
        try: 
            return super(MyImageFolder, self).__getitem__(index)
        except Exception as e:
            print e

dataset= MyImageFolder('/home/x/train/', transform = transforms.Compose([transforms.ToTensor()..]) )
dataloader=t.utils.data.DataLoader(dataset, 4, True, collate_fn=my_collate)

not so pythonic, but it works.
and I think the best way maybe just cleaning the data.

PermanentPon · May 9, 2018, 10:18am

To solve this particular problem with corrupted images you can just add two lines before your code:

from PIL import ImageFile
ImageFile.LOAD_TRUNCATED_IMAGES = True

as suggested here

Sun_ShiJie · May 19, 2018, 10:47am

Hi, Chen. Does the batch size decrease when use this way to filter invalid data.

msamogh · October 11, 2018, 9:25am

I wrote a PyTorch library that allows you to do exactly this - drop samples, handle bad items, use Transforms as Filters (and more). Check out nonechucks.

Using nonechucks, your code would look something like this:

dataset= ImageFolder('/home/x/data/pre/train/',
                      transform=transforms.Compose([transforms.Scale(opt.image_size),
                                              transforms.RandomCrop(opt.image_size) ,
                                              transforms.ToTensor(),
                                              transforms.Normalize([0.5]*3,[0.5]*3)
                                             ]))

import nonechucks as nc
dataset = nc.SafeDataset(dataset)
dataloader = nc.SafeDataLoader(dataset,opt.batch_size,True,num_workers=opt.workers)

# You can now use `dataloader` as though it was a regular DataLoader without
# having to worry about the bad samples!

Feel free to check out the documentation on the Github page!

nimning · November 10, 2018, 10:41pm

How you get the internal function ‘default_collate’? It is within the ‘dataloder.py’.

tom · November 10, 2018, 10:52pm

It’s imported under that module, just not directly under .data:

import torch
import torch.utils.data
print(torch.utils.data.dataloader.default_collate)

Best regards

Thomas

zanaa · December 15, 2018, 6:56pm

Note: for python 3, replace

batch = filter(lambda x : x is not None, batch)

with

batch = list(filter(lambda x : x is not None, batch))

Shivam_Chandhok · March 25, 2019, 12:18pm

I am getting an error:

samples = collate_fn([dataset[i] for i in batch_indices])
TypeError: ‘DataLoader’ object does not support indexing

What am I doing wrong.

weiwei · June 7, 2019, 11:44am

hi, i have similar problem about share memory, could you help me to solve the problem ? thanks very much, the topic link is :

Martin36 · July 14, 2021, 7:13pm

This does not work if the batch size is 1, since default_collate will be called with an empty list in that case. The same thing could happen if the batch size is larger and all the data samples in the batch size are corrupt (highly unlikely though).

Is there a way to get this to work when the batch size is only 1?

Keyv_Krmn · July 28, 2021, 11:35pm

I faced the same issue. my hot fix was to return a datapoint drawn randomly from the dataset.
so basically, catch the error and return a valid batch with a single datapoint.
It can happen in other cases where the batch_size > 1,but last batch has only 1 datapoint.

xu555 · December 26, 2021, 2:55am

from torch.utils.data.dataloader import default_collate

Syzygianinfern0 · February 10, 2022, 1:57pm

Did you find any fix?

Edit: I found a fix here. It has many other relevant answers as well.

Anirudh257 · January 8, 2024, 8:58pm

I also found a fix inspired by here.

def collate_fn_replace_corrupted(batch, dataset):
    if batch[0] is None:
        batch = [dataset[randint(0, len(dataset) - 1)]]
        return collate_fn_replace_corrupted(batch, dataset)

    return torch.utils.data.dataloader.default_collate(batch)

  collate_fn = functools.partial(collate_fn_replace_corrupted, dataset = dataset)