Unable to understand generating the data process in parallel

Hi Guys,
I am trying to generate data in parallel using this tutorial. I am using it for image classification problem so I have image data in jpeg format.
I have a very basic doubt, in the below code , the data is being accessed using X = torch.load('data/' + ID + '.pt'), is this some sort of pickling the data or is the general notation for data of every kind.
How to convert my image data in this format(Also is this a better approach or not)?

def __getitem__(self, index):
'Generates one sample of data'
# Select sample
    ID = self.list_IDs[index]
# Load data and get label
    X = torch.load('data/' + ID + '.pt')
    y = self.labels[ID]

    return X, y

It depends a bit on your use case.
You could save some data loading time, if you preload the data (e.g. images), process it all, and store it as binary data. However, since e.g. JPEG images are compressed, while binary data is usually not, you could potentially use more space on your device.
Also, in order to augment the data on the fly during training you would have to transform the samples back to images, if you want to use PIL.Image transforms, and then back again to tensors, so that your initial benefit might not be visible anymore.

1 Like