Dataloader iteration hang up

I want to train a model on my own dataset, I implement a custom dataset and a custom dataloader according to Data Loading and Processing Tutorial. When I run the training process, I can not get the data patch. The process hang up for a long time and no errors are reported.

Dataset

class CustomDataset(BaseDataset):
    def initialize(self, opt):
        self.root = opt.dataroot
        print 'self.root: {}'.format(self.root)
        self.imgs = os.listdir(self.root)
        self.imgs = sorted(self.imgs)
        self.fdir = os.path.join(os.path.abspath(os.path.join(self.root,'..')), 'features')
        self.feas = [os.path.join(self.fdir, img) for img in self.imgs]
        self.imgs = [os.path.join(self.root, img) for img in self.imgs]
        self.transform = transforms.Compose([transforms.ToTensor(),])

    def __getitem__(self, index):
        i_path = self.imgs[index]
        f_path = self.feas[index]
        img = Image.open(i_path).convert('RGB')
        img = self.transform(img)
        feature = torch.Tensor(cPickle.load(open(f_path, 'rb')))
        #feature = torch.Tensor([1,])
        #print 'len(img):{}'.format(img.size())
        #print 'len(feature):{}'.format(feature.size())
        input_dict = {'img': img, 'feature': feature}
        return input_dict

    def __len__(self):
        return len(self.imgs)

    def name(self):
        return 'CustomDataset'

DataLoader

def CreateDataset(opt):
    dataset = None
    from custom_dataset import CustomDataset
    dataset = CustomDataset()
    dataset.initialize(opt)
    return dataset


class CustomDataLoader(BaseDataLoader):
    def name(self):
        return 'CustomDataLoader'


    def initialize(self, opt):
        BaseDataLoader.initialize(self, opt)
        self.dataset = CreateDataset(opt)
        self.dataloader = torch.utils.data.DataLoader(
                self.dataset, 
                batch_size=opt.batchSize, 
                shuffle=not opt.shuffle, 
                num_workers=int(opt.workers))


    def load_data(self):
        return self
    

    def __len__(self):
        return min(len(self.dataset), self.opt.max_dataset_size)
    

    def __iter__(self):
        for i, data in enumerate(self.dataloader):
            print 'custom_dataloader:{}'.format(i)
            if i >= self.opt.max_dataset_size:
                break
            yield data

Ah, yeah, there is an issue with dataloader workers can sometime be signaled or errors but the main process just hangs. Try running with num_workers = 0 to get a proper debug trace. This is (sort of) fixed in https://github.com/pytorch/pytorch/pull/3474 if you are interested.

5 Likes

Thank you for your solution. After I run the process with num_workers = 0 , the dataloader goes iterable. However, I still confuse about this problem.

Well, if you set num_workers = 0, you will take a performance hit. it seems that the issue is related to multiprocessing then.

Could you change to use torch.load and torch.save for your tensors? Pickling tensors with default pickle module is known to be very slow in some cases.

1 Like

Thank you for your answer. I reinstalled my os and pytorch requirements these days. I still get that only num_workers = 0 can perform well.
I use torch.load and torch.save for my tensors, thank you so much for your help.

I am suspecting that the pickle.load (or now torch.load if you changed to use that) are not playing well with multiprocessing somehow. Could you comment that out, and put something like feature=None and see if you can iterate through the dataset?

If you don’t mind, could you try installing pytorch from the github source? There have been a number of improvements on dataloaders since last release. Some of them are specifically done to prevent dataloader from hanging. The instructions in README.md at https://github.com/pytorch/pytorch are not very complicated.

I think pickle.load play well with multiprocessing while ‘img’ not. I commented image reading out like following:

feature = pickle.load(xx)
img = feature

The dataload process then went well.

That’s interesting. Could you try reinstalling/upgrading Pillow?

This solution also worked for me. My problem was not a custom dataset, but a custom-defined image augmentation transform. When I used typical torchvision.transform transformations, the data loader was iterable even with num_workers=2. However, when I added my custom-made transform, the iter(trainloader) was just freezing without any errors.

Thanks for the solution. Did not understand why this issue exists though. I also did not understand the point about using load and save rather than the default piclkle.

1 Like