Data loader for large dataset having multiple folders with multiple images and labels

I am new to PyTorch and have a small issue with creating Data Loaders for huge datasets. I have a folder “/train” with two folders “/images” and “/labels”. Now, these folders further have 1000 folders that contain 1000 images and 1000 labels in each. I am new to creating custom data loaders. After reading the PyTorch documentation I was able to create the following class but since the dataset is too big 350 GB my code will not work. Can someone please point me in the right direction?

import numpy as np
import os 
from torch.utils.data import Dataset
from PIL import Image

class CustomDataset(Dataset):
    def __init__(self, data_root):
        self.data_root = data_root
        self.images = []
        self.labels = []
        for filenames in os.listdir(self.data_root)
            data = np.load(data_root+'/'+filename)
            img = data["images"]
            lbl = data["labels"]
            for image, label in zip(img,lbl):
                img = Image.fromarray(image)
                self.images.append(img)
                self.labels.append(label)

    def __len__(self):
        return len(self.filenames)*1000

    def __getitem__(self, index):
        return self.images[index], self.labels[index]

You could lazily load the data by defining the paths etc. in the __init__ method and move the actual loading to the __getitem__ method using the already defined paths and the index.

I could imagine then I need a for loop inside __getitem__ method for going into all the folders that contain the images. Or I need two index values 1. for going into a folder 2. for getting an image.
I thought of another approach that is to run a script that creates a .txt file that contains full paths to all the images with its labels.
After this, I just need an index value to get an image at a location that is at line number:index of the .txt file.

1 Like

The second approach sounds like the right one and would be the usual workflow.

1 Like