Dataloader iteration hang up

dieuroi · January 26, 2018, 1:30pm

I want to train a model on my own dataset, I implement a custom dataset and a custom dataloader according to Data Loading and Processing Tutorial. When I run the training process, I can not get the data patch. The process hang up for a long time and no errors are reported.

Dataset

class CustomDataset(BaseDataset):
    def initialize(self, opt):
        self.root = opt.dataroot
        print 'self.root: {}'.format(self.root)
        self.imgs = os.listdir(self.root)
        self.imgs = sorted(self.imgs)
        self.fdir = os.path.join(os.path.abspath(os.path.join(self.root,'..')), 'features')
        self.feas = [os.path.join(self.fdir, img) for img in self.imgs]
        self.imgs = [os.path.join(self.root, img) for img in self.imgs]
        self.transform = transforms.Compose([transforms.ToTensor(),])

    def __getitem__(self, index):
        i_path = self.imgs[index]
        f_path = self.feas[index]
        img = Image.open(i_path).convert('RGB')
        img = self.transform(img)
        feature = torch.Tensor(cPickle.load(open(f_path, 'rb')))
        #feature = torch.Tensor([1,])
        #print 'len(img):{}'.format(img.size())
        #print 'len(feature):{}'.format(feature.size())
        input_dict = {'img': img, 'feature': feature}
        return input_dict

    def __len__(self):
        return len(self.imgs)

    def name(self):
        return 'CustomDataset'

DataLoader

def CreateDataset(opt):
    dataset = None
    from custom_dataset import CustomDataset
    dataset = CustomDataset()
    dataset.initialize(opt)
    return dataset


class CustomDataLoader(BaseDataLoader):
    def name(self):
        return 'CustomDataLoader'


    def initialize(self, opt):
        BaseDataLoader.initialize(self, opt)
        self.dataset = CreateDataset(opt)
        self.dataloader = torch.utils.data.DataLoader(
                self.dataset, 
                batch_size=opt.batchSize, 
                shuffle=not opt.shuffle, 
                num_workers=int(opt.workers))


    def load_data(self):
        return self
    

    def __len__(self):
        return min(len(self.dataset), self.opt.max_dataset_size)
    

    def __iter__(self):
        for i, data in enumerate(self.dataloader):
            print 'custom_dataloader:{}'.format(i)
            if i >= self.opt.max_dataset_size:
                break
            yield data

SimonW · January 26, 2018, 7:21pm

Ah, yeah, there is an issue with dataloader workers can sometime be signaled or errors but the main process just hangs. Try running with num_workers = 0 to get a proper debug trace. This is (sort of) fixed in https://github.com/pytorch/pytorch/pull/3474 if you are interested.

dieuroi · January 27, 2018, 7:45am

Thank you for your solution. After I run the process with num_workers = 0 , the dataloader goes iterable. However, I still confuse about this problem.

SimonW · January 27, 2018, 8:44am

Well, if you set num_workers = 0, you will take a performance hit. it seems that the issue is related to multiprocessing then.

Could you change to use torch.load and torch.save for your tensors? Pickling tensors with default pickle module is known to be very slow in some cases.

dieuroi · January 30, 2018, 4:20pm

Thank you for your answer. I reinstalled my os and pytorch requirements these days. I still get that only num_workers = 0 can perform well.
I use torch.load and torch.save for my tensors, thank you so much for your help.

SimonW · January 30, 2018, 4:54pm

I am suspecting that the pickle.load (or now torch.load if you changed to use that) are not playing well with multiprocessing somehow. Could you comment that out, and put something like feature=None and see if you can iterate through the dataset?

If you don’t mind, could you try installing pytorch from the github source? There have been a number of improvements on dataloaders since last release. Some of them are specifically done to prevent dataloader from hanging. The instructions in README.md at https://github.com/pytorch/pytorch are not very complicated.

dieuroi · January 30, 2018, 5:04pm

I think pickle.load play well with multiprocessing while ‘img’ not. I commented image reading out like following:

feature = pickle.load(xx)
img = feature

The dataload process then went well.

SimonW · January 30, 2018, 6:35pm

That’s interesting. Could you try reinstalling/upgrading Pillow?

pniaz20 · January 21, 2022, 8:57pm

This solution also worked for me. My problem was not a custom dataset, but a custom-defined image augmentation transform. When I used typical torchvision.transform transformations, the data loader was iterable even with num_workers=2. However, when I added my custom-made transform, the iter(trainloader) was just freezing without any errors.

Thanks for the solution. Did not understand why this issue exists though. I also did not understand the point about using load and save rather than the default piclkle.