Hi, I have a problem with a project I’m developing with Pytorch (Autoencoders for anomaly detection).
I have a very large training set composed of over 400000 images, each of size (256,256,4), and in order to handle it in an efficient way I decided to implement a custom Dataset by extending the pytorch corresponding class. The images are contained in a folder called DATASET, which contains another folder called “train”, which contains another folder called “clean” with all the images of the training set. Same for the test set, where there’s a folder called “test”, inside of which there are other folders with images corresponding to specific labels.
The DataSet class apparently works, but when I create the corresponding Dataloader it doesn’t work anymore.
This is the implementation of my Dataset:
class ImageDataset(Dataset):
def __init__(self, directory, transform=None):
self.paths = list(paths.list_images(directory))
self.transform = transform
self.classes, self.class_to_idx = torchvision.datasets.folder.find_classes(directory)
def __len__(self):
return len(self.paths)
def __getitem__(self, index):
img = Image.open(self.paths[index])
if self.transform:
img = self.transform(img)
label = os.path.basename(os.path.dirname(self.paths[index]))
class_idx = self.class_to_idx[label]
return img, class_idx
def get_classes(self):
return self.classes
def get_name(self, index):
return self.paths[index]
where “directory” is the directory that contains the images. “self.paths” is a list of paths to all the images inside the folder “directory”. " As needed, I implemented the len method and also the getitem, where I open the img using Image.open and selecting the path corresponding to “index”. I also implemented other two methods that I need for another function.
This is how I get the train directory:
train_dir = Path('DATASET').joinpath('train')
If I try to create the Datasets for the training and the test, apparently it works:
dataset_to_split = ImageDataset(train_dir, transform=ToTensor())
test_dataset = ImageDataset(test_dir, transform=ToTensor())
print("Number of images in the whole train dataset: ", len(dataset_to_split))
print("Classes of the train dataset: ", dataset_to_split.classes, "\n")
print("Number of images in the test dataset: ", len(test_dataset))
print("Classes of the test dataset: ", test_dataset.classes)
I get the following output:
Number of images in the whole train dataset: 435404
Classes of the train dataset: [‘clean’]
Number of images in the test dataset: 15055
Classes of the test dataset: [‘clean’, ‘dos11’, ‘dos53’, ‘scan11’, ‘scan44’]
If then I try to print one element of the dataset like this:
print(dataset_to_split[0])
print("\n\n", test_dataset[15000])
I correctly get a tuple containing the tensor (so the image) and the label in both cases (for the labels I get 0 in the first case, 4 in the second, which is correct). That’s why I say that apparently it works.
But when I try to create the dataloader (for example considering the test_dataset, but with the other one the issue is exactly the same) it just does not print anything and the kernel enters in an infinite loop without doing anything:
batch_size = 10
test_dataloader = DataLoader(test_dataset, batch_size, shuffle=False, num_workers=os.cpu_count())
And then if I try to print the labels (or any other thing) in this way, it does not do anything:
for batch, label in test_dataloader:
print(label)
Any idea? I really don’t understand what’s wrong in my code.
Thank you very much
EDIT: I forgot to say that if I try to print the length of test_dataloader:
print(len(test_dataloader))
I got 1506 as output, which is correct since the lenght of test_dataset is 15055.