Dataset class loading multiple data files

The problem: I have images that I’ve loaded and then stored to numpy arrays. The dataset is quite big so I realized I have to split it into different files where I can load one at a time. I’ve tried to create my own dataset class as follows

class my_Dataset(Dataset):
    # Characterizes a dataset for PyTorch
    def __init__(self, folder_dataset, transform=None):
        # xs, ys will be name of the files of the data
        self.xs = []
        self.ys = []
        self.transform = transform

        # Open and load text file including the whole training data
        with open(folder_dataset + 'data.txt') as f:
            for line in f:
                self.xs.append(folder_dataset+line.split()[0])
                self.xs.append(folder_dataset + line.split()[1])

        # pick a random of these (sub)files of the dataset
        file_ID = np.random.randint(1, len(self.xs))

        numpy_data = np.load('x_imgs_ID_' + str(file_ID) + '.npy')
        numpy_data = np.moveaxis(numpy_data, [3], [1])
        numpy_target = np.load('y_imgs_ID' + str(file_ID) + '.npy')

        # make numpy arrays to tensors
        self.data = torch.from_numpy(numpy_data).float()
        self.target = torch.from_numpy(numpy_target).long()

    def __len__(self):
        return len(self.data)

    def __getitem__(self, index):
        # Generates one sample of data, done with pytorch's own index?
        single_x = self.data[index]
        single_y = self.target[index]

        return single_x,single_y

I’m new to PyTorch and deeplearning in general so I’m trying to learn. Am I doing it correctly in this dataset class? Specifically the np.random part for me feels super weird. I’ve read a bunch of posts and tutorials on trying to understand how this works but this still feels difficult for me to grasp.

Anyone that can give me some feedback on this is very much appreciated!

/Dino

It looks like you are only loading a single sample in __init__, which will thus only return this one picked random data sample.

Since you have a lot of files (and each is stored in an own file), I would suggest to lazily load each sample in __getitem__.
This function should therefore contain the logic to load a single sample given the file paths and an index.
You don’t have to manually shuffle the file paths (or pick a sample randomly), as the DataLoader will do it for us.
The __init__ method of your Dataset can therefore only contain the logic to store all file paths.
Here is some dummy code:

class my_Dataset(Dataset):
    # Characterizes a dataset for PyTorch
    def __init__(self, data_paths, target_paths, transform=None):
        self.data_paths = data_paths
        self.target_paths = target_paths
        self.transform = transform

    def __len__(self):
        return len(self.data_paths)

    def __getitem__(self, index):
        x = torch.from_numpy(np.load(self.data_paths[index]))
        y = torch.from_numpy(np.load(self.target_paths[index]))
        if self.transform:
            x = self.transform(x)

        return x, y

In this example you would have to create the data and target paths before and pass them as arguments to the Dataset.
Once this is done, just wrap it in a DataLoader, which can create batches and e.g. shuffle your data:

loader = DataLoader(
    dataset,
    batch_size=5,
    shuffle=True,
    num_workers=2
)

Also, the Data Loading Tutorial might be a good starter for some more information.

1 Like

Thank you for your reply ptrblck. I tried something similar to this at first but I didn’t get it to work. I have the data set split in several ‘.npy’ files and in init I am trying to read an arbitrary file of these. Each ‘.npy’ file that I have has like 5000 images so it’s numpy array is of shape (5000,3,32,32) because RGB images and 32x32 pixels.

When I tried to do it in a similar way to this PyTorch generated: (mini_batches, 5000, 3, 32, 32). So it generated mini_batches of the files which were all already 5000 images.

So sorry for being unclear but the init loads a file that is a subset of the dataset and contains about 5000 images.

Thanks for the information! I misunderstood the data loading and thought each file was saved separately.
Do you still have the large file containing all images?
If so and if it’s saved as a binary file using numpy, you might load only certain parts of this file with np.memmap.

Thank you. I tried np.memmap and that works!

I also tried to make it the perhaps more standard way of the tutorial you linked which I checked before. Is this the most common way of loading data when having large datasets, that you have all the images together with an annotation file like a csv file that you can you load into memory? This is the result I got when trying to follow that tutorial but it still seems I might be missing something, getitem feels like there should be a cleaner way of doing the same thing. Perhaps you could give it a quick look and see if I’m on the right path

class my_Dataset(Dataset):
    # Characterizes a dataset for PyTorch
    def __init__(self, root_dir, csv_file, transform=None):
        ''' Args:
                csv_file (string): Path to csv file with annotations.
                root_dir (string): Directory with all the images.
                transform (optional): transforms to be applied on a sample.
        '''
        self.root_dir = root_dir
        self.annotations = pd.read_csv(csv_file + '.csv')
        self.transform = transform

    def __len__(self):
        return len(self.annotations)

    def __getitem__(self, index):
        img_name = os.path.join(self.root_dir, self.annotations.iloc[index].name)
        image = io.imread(img_name)
        image = np.moveaxis(np.asarray(image), [2], [0])
        image = torch.from_numpy(image)
        y_label = self.annotations.iloc[index,:].values
        y_label = torch.from_numpy(self.annotations.iloc[index,:].values)

        if self.transform:
            image = self.transform(image)

        return image, y_label

I’ve seen you answer a lot of questions and I just want to say that I appreciate you helping us all out. It really means a lot.

1 Like

The code looks fine!
You could use PIL or PIL-SIMD to load the images, as the torchvision transformations are defined for PIL.Images (e.g. random cropping etc.).
If you don’t need image data augmentations, you are good to go. :slight_smile:

Facing a similar situation where I do not have the larger file containing all the images. I have data split in several numpy files. I can not load the numpy files and concat them because of memory issues. Can you suggest solution/pointers for this use case.

Thanks in advance

If each numpy file contains multiple images, you could use a logic, where the Dataset would open only a single numpy array and index it until all samples were used. Once this numpy array was completely used, you would open the next one and assign it to a class attribute to keep it open.

Shuffling the data would work for the different numpy arrays, but could yield a bad performance on the “image level”, since your code would be opening and closing the numpy arrays multiple times.

I’m currently using something like this:

for epoch in range(1):
    for img_path,emb_path in zip(img_emb_path,text_emb_path):
        train_dataset=CustomDataset(img_path,emb_path)
        train_data=DataLoader(train_dataset,batch_size=64,shuffle=True)
        for i, (inputs, labels) in tqdm(enumerate(train_data)):
            a=a+1
        print('singly npy processed')

My memory seems to be full with two runs of the iteration over DataLoader. Any suggestions for the same?