Hi, I have a problem with a project I’m developing with Pytorch (Autoencoders for anomaly detection).
I have a very large training set composed of over 400000 images, each of size (256,256,4), and in order to handle it in an efficient way I decided to implement a custom Dataset by extending the pytorch corresponding class. The images are contained in a folder called DATASET, which contains another folder called “train”, which contains another folder called “clean” with all the images of the training set. Same for the test set, where there’s a folder called “test”, inside of which there are other folders with images corresponding to specific labels.
The DataSet class apparently works, but when I create the corresponding Dataloader it doesn’t work anymore.
This is the implementation of my Dataset:
class ImageDataset(Dataset): def __init__(self, directory, transform=None): self.paths = list(paths.list_images(directory)) self.transform = transform self.classes, self.class_to_idx = torchvision.datasets.folder.find_classes(directory) def __len__(self): return len(self.paths) def __getitem__(self, index): img = Image.open(self.paths[index]) if self.transform: img = self.transform(img) label = os.path.basename(os.path.dirname(self.paths[index])) class_idx = self.class_to_idx[label] return img, class_idx def get_classes(self): return self.classes def get_name(self, index): return self.paths[index]
where “directory” is the directory that contains the images. “self.paths” is a list of paths to all the images inside the folder “directory”. " As needed, I implemented the len method and also the getitem, where I open the img using Image.open and selecting the path corresponding to “index”. I also implemented other two methods that I need for another function.
This is how I get the train directory:
train_dir = Path('DATASET').joinpath('train')
If I try to create the Datasets for the training and the test, apparently it works:
dataset_to_split = ImageDataset(train_dir, transform=ToTensor()) test_dataset = ImageDataset(test_dir, transform=ToTensor()) print("Number of images in the whole train dataset: ", len(dataset_to_split)) print("Classes of the train dataset: ", dataset_to_split.classes, "\n") print("Number of images in the test dataset: ", len(test_dataset)) print("Classes of the test dataset: ", test_dataset.classes)
I get the following output:
Number of images in the whole train dataset: 435404
Classes of the train dataset: [‘clean’]
Number of images in the test dataset: 15055
Classes of the test dataset: [‘clean’, ‘dos11’, ‘dos53’, ‘scan11’, ‘scan44’]
If then I try to print one element of the dataset like this:
print(dataset_to_split) print("\n\n", test_dataset)
I correctly get a tuple containing the tensor (so the image) and the label in both cases (for the labels I get 0 in the first case, 4 in the second, which is correct). That’s why I say that apparently it works.
But when I try to create the dataloader (for example considering the test_dataset, but with the other one the issue is exactly the same) it just does not print anything and the kernel enters in an infinite loop without doing anything:
batch_size = 10 test_dataloader = DataLoader(test_dataset, batch_size, shuffle=False, num_workers=os.cpu_count())
And then if I try to print the labels (or any other thing) in this way, it does not do anything:
for batch, label in test_dataloader: print(label)
Any idea? I really don’t understand what’s wrong in my code.
Thank you very much
EDIT: I forgot to say that if I try to print the length of test_dataloader:
I got 1506 as output, which is correct since the lenght of test_dataset is 15055.