DataLoader does not work on custom Dataset

Hi, I have a problem with a project I’m developing with Pytorch (Autoencoders for anomaly detection).

I have a very large training set composed of over 400000 images, each of size (256,256,4), and in order to handle it in an efficient way I decided to implement a custom Dataset by extending the pytorch corresponding class. The images are contained in a folder called DATASET, which contains another folder called “train”, which contains another folder called “clean” with all the images of the training set. Same for the test set, where there’s a folder called “test”, inside of which there are other folders with images corresponding to specific labels.

The DataSet class apparently works, but when I create the corresponding Dataloader it doesn’t work anymore.

This is the implementation of my Dataset:

class ImageDataset(Dataset):
    def __init__(self, directory, transform=None):
        self.paths = list(paths.list_images(directory))
        self.transform = transform
        self.classes, self.class_to_idx = torchvision.datasets.folder.find_classes(directory)  
    
    def __len__(self):
        return len(self.paths)
    
    def __getitem__(self, index):
        img = Image.open(self.paths[index])
        if self.transform:
            img = self.transform(img)
            
        label = os.path.basename(os.path.dirname(self.paths[index]))    
        class_idx = self.class_to_idx[label]
        return img, class_idx
    
    def get_classes(self):
        return self.classes
        
    def get_name(self, index):
        return self.paths[index]

where “directory” is the directory that contains the images. “self.paths” is a list of paths to all the images inside the folder “directory”. " As needed, I implemented the len method and also the getitem, where I open the img using Image.open and selecting the path corresponding to “index”. I also implemented other two methods that I need for another function.

This is how I get the train directory:

train_dir = Path('DATASET').joinpath('train')

If I try to create the Datasets for the training and the test, apparently it works:

dataset_to_split = ImageDataset(train_dir, transform=ToTensor())
test_dataset = ImageDataset(test_dir, transform=ToTensor())

print("Number of images in the whole train dataset: ", len(dataset_to_split))
print("Classes of the train dataset: ", dataset_to_split.classes, "\n")
print("Number of images in the test dataset: ", len(test_dataset))
print("Classes of the test dataset: ", test_dataset.classes)

I get the following output:

Number of images in the whole train dataset: 435404

Classes of the train dataset: [‘clean’]

Number of images in the test dataset: 15055

Classes of the test dataset: [‘clean’, ‘dos11’, ‘dos53’, ‘scan11’, ‘scan44’]

If then I try to print one element of the dataset like this:

print(dataset_to_split[0])
print("\n\n", test_dataset[15000])

I correctly get a tuple containing the tensor (so the image) and the label in both cases (for the labels I get 0 in the first case, 4 in the second, which is correct). That’s why I say that apparently it works.

But when I try to create the dataloader (for example considering the test_dataset, but with the other one the issue is exactly the same) it just does not print anything and the kernel enters in an infinite loop without doing anything:

batch_size = 10
test_dataloader = DataLoader(test_dataset, batch_size, shuffle=False, num_workers=os.cpu_count())

And then if I try to print the labels (or any other thing) in this way, it does not do anything:

for batch, label in test_dataloader:
    print(label)

Any idea? I really don’t understand what’s wrong in my code.
Thank you very much

EDIT: I forgot to say that if I try to print the length of test_dataloader:

print(len(test_dataloader))

I got 1506 as output, which is correct since the lenght of test_dataset is 15055.

Try to set num_workers=0 as your env might have issues using multiprocessing.

2 Likes

It works! Thank you very much and sorry for the dumb question :sweat_smile: this is the first project I develop with Pytorch

It’s not a dumb question at all and multiple workers should work. My suggestion was just a debugging step so see if multiprocessing might be at fault. Were you able to use num_workers>0 on this system before? If so, what changed? Also, could you check the latest nightly release to see if the code would still hang?

I’ve never tried to use data loaders before so I don’t really know to be honest

Also, apparently this is related to the fact that I’m using Jupyter Notebook and there are some known problems in using multiple num_workers with Jupyter. I will try to use a .py file and let you know if it changes.

1 Like