Dataset creation from multiple image files

Hi, I’m new in the pytorch world and so I need help.

I need to create a dataset loading multiple files stored in disk (for a total go 35 GB).

Each file contains numpy array representing images pixels of shape (5000, 256, 256, 3).

How can I achieve the dataset creation loading these data?

I’m sorry for the lack of informations but I really don’t know where to start.

I read something about DataSet and DataLoader.

The Data loading tutorial gives you a good overview how to write a custom Dataset and use the DataLoader with it.
In the default use case the Dataset would load and process a single samples in the __getitem__ method using the passed index and would initialize e.g. the data paths in its __init__ method.

Since your data already stores multiple samples, you could still use this lazy loading approach and preload the next file, if the current one doesn’t contain enough samples anymore.
The workflow would thus be:

  • load file0 with 50000 samples and keep it as an attribute
  • create batches of data from this file until it’s empty or the remaining number of samples is smaller than the batch size
  • load new file and repeat until all files were used.

The shortcoming of this approach would be that you wouldn’t be able to easily shuffle the data.
I.e. if you are using the passed index to decide if you should load the next data file, a shuffled index could trigger a constant file swap, which would yield a very bad performance. However, once a single file is loaded, you could create a lookup table with shuffled indices to at least shuffle the samples in each file.

Thank you for your reply! I read the tutorial and I have a question.
I also have a CSV file containing, for each row, the path to the image and the related label.
In the tutorial they say

We will read the csv in __init__ but leave the reading of images to __getitem__ . This is memory efficient because all the images are not stored in the memory at once but read as required.

Do you think it’s better (in memory and performance terms) for me using the tutorial approach instead of loading the files?

The CSV approach would be easier to implement and would allow a complete dataset shuffling, so I would recommend using it. :wink:

1 Like

Yes!
Here’s my code

class AffWild2Dataset(Dataset):
    def __init__(self, flag, transform = None):
        self.flag = flag
        
        if flag == 'train':
            csv_path = '/Volumes/Orsetto/TESI Sentiment Analysis/code/pre_processing_AffWild2/Aff-Wild2/Aff-Wild2_ready/faces_detected_OK/train_set/train_set.csv'
        else:
            csv_path = '/Volumes/Orsetto/TESI Sentiment Analysis/code/pre_processing_AffWild2/Aff-Wild2/Aff-Wild2_ready/faces_detected_OK/validation_set/validation_set.csv'
        
        self.emotion_frame = pd.read_csv(csv_path, sep=";",encoding='utf8')
        self.transform = transform
   

    def __len__(self):
        return len(self.emotion_frame)
    
    def __getitem__(self, index):
        if self.flag == 'train':
            img_path = self.emotion_frame.iloc[index, 0]           
            fp = os.path.join('/Volumes/Orsetto/TESI Sentiment Analysis/code/pre_processing_AffWild2/Aff-Wild2/Aff-Wild2_ready/faces_detected_OK/train_set%s' %img_path)
        else:
            img_path = self.emotion_frame.iloc[index, 0]
            fp = os.path.join('/Volumes/Orsetto/TESI Sentiment Analysis/code/pre_processing_AffWild2/Aff-Wild2/Aff-Wild2_ready/faces_detected_OK/validation_set%s' %img_path)
        
        img_array = imageio.imread(fp).astype(float)
        img_array /= 255
        
        y_label = self.emotion_frame['label'].values[index]
        y_label = torch.tensor(y_label)
        
        sample = {'face': img_array, 'label': y_label}
        
        if self.transform:
            sample = self.transform(sample)
        
        return sample