DataLoader does not work on custom Dataset

Conca_Giulio · August 24, 2023, 10:11pm

Hi, I have a problem with a project I’m developing with Pytorch (Autoencoders for anomaly detection).

I have a very large training set composed of over 400000 images, each of size (256,256,4), and in order to handle it in an efficient way I decided to implement a custom Dataset by extending the pytorch corresponding class. The images are contained in a folder called DATASET, which contains another folder called “train”, which contains another folder called “clean” with all the images of the training set. Same for the test set, where there’s a folder called “test”, inside of which there are other folders with images corresponding to specific labels.

The DataSet class apparently works, but when I create the corresponding Dataloader it doesn’t work anymore.

This is the implementation of my Dataset:

class ImageDataset(Dataset):
    def __init__(self, directory, transform=None):
        self.paths = list(paths.list_images(directory))
        self.transform = transform
        self.classes, self.class_to_idx = torchvision.datasets.folder.find_classes(directory)  
    
    def __len__(self):
        return len(self.paths)
    
    def __getitem__(self, index):
        img = Image.open(self.paths[index])
        if self.transform:
            img = self.transform(img)
            
        label = os.path.basename(os.path.dirname(self.paths[index]))    
        class_idx = self.class_to_idx[label]
        return img, class_idx
    
    def get_classes(self):
        return self.classes
        
    def get_name(self, index):
        return self.paths[index]

where “directory” is the directory that contains the images. “self.paths” is a list of paths to all the images inside the folder “directory”. " As needed, I implemented the len method and also the getitem, where I open the img using Image.open and selecting the path corresponding to “index”. I also implemented other two methods that I need for another function.

This is how I get the train directory:

train_dir = Path('DATASET').joinpath('train')

If I try to create the Datasets for the training and the test, apparently it works:

dataset_to_split = ImageDataset(train_dir, transform=ToTensor())
test_dataset = ImageDataset(test_dir, transform=ToTensor())

print("Number of images in the whole train dataset: ", len(dataset_to_split))
print("Classes of the train dataset: ", dataset_to_split.classes, "\n")
print("Number of images in the test dataset: ", len(test_dataset))
print("Classes of the test dataset: ", test_dataset.classes)

I get the following output:

Number of images in the whole train dataset: 435404

Classes of the train dataset: [‘clean’]

Number of images in the test dataset: 15055

Classes of the test dataset: [‘clean’, ‘dos11’, ‘dos53’, ‘scan11’, ‘scan44’]

If then I try to print one element of the dataset like this:

print(dataset_to_split[0])
print("\n\n", test_dataset[15000])

I correctly get a tuple containing the tensor (so the image) and the label in both cases (for the labels I get 0 in the first case, 4 in the second, which is correct). That’s why I say that apparently it works.

But when I try to create the dataloader (for example considering the test_dataset, but with the other one the issue is exactly the same) it just does not print anything and the kernel enters in an infinite loop without doing anything:

batch_size = 10
test_dataloader = DataLoader(test_dataset, batch_size, shuffle=False, num_workers=os.cpu_count())

And then if I try to print the labels (or any other thing) in this way, it does not do anything:

for batch, label in test_dataloader:
    print(label)

Any idea? I really don’t understand what’s wrong in my code.
Thank you very much

EDIT: I forgot to say that if I try to print the length of test_dataloader:

print(len(test_dataloader))

I got 1506 as output, which is correct since the lenght of test_dataset is 15055.

ptrblck · August 24, 2023, 10:24pm

Try to set num_workers=0 as your env might have issues using multiprocessing.

Conca_Giulio · August 24, 2023, 10:35pm

It works! Thank you very much and sorry for the dumb question this is the first project I develop with Pytorch

ptrblck · August 25, 2023, 1:01am

It’s not a dumb question at all and multiple workers should work. My suggestion was just a debugging step so see if multiprocessing might be at fault. Were you able to use num_workers>0 on this system before? If so, what changed? Also, could you check the latest nightly release to see if the code would still hang?

Conca_Giulio · August 25, 2023, 5:05am

I’ve never tried to use data loaders before so I don’t really know to be honest

Conca_Giulio · August 25, 2023, 5:14am

Also, apparently this is related to the fact that I’m using Jupyter Notebook and there are some known problems in using multiple num_workers with Jupyter. I will try to use a .py file and let you know if it changes.