Following the instruction of multiple tutorials regarding custom dataloaders for images, the way to do it is the following: (source)
class Dataset(data.Dataset):
def __init__(self, list_IDs, labels, transform = None):
self.labels = labels
self.list_IDs = list_IDs
self.transform = transform
def __len__(self):
return len(self.list_IDs)
def __getitem__(self, index):
# Select sample
ID = self.list_IDs[index]
#load data and get label
X = torch.load(f"path{ID}.pt")
if self.transform:
X = self.transform(X)
y = torch.Tensor(self.labels[ID])
return X,y
When I implement it this way, my training loop is extremely slow, my CPU usage at 100% and my GPU usage ~2% (thats bad)
So i created a new way loading the data, where I saved the batches before and load one whole batch at a time in the training loop (avoids using torchs dataset and dataloader):
for train_file in train_files: #train_file is a whole batch of 64 images saved as Tensor
X = torch.load(image_path + f"\\{train_file}", device)
X = transform(X)
y = torch.load(r"label_path" + f"\\{train_file}", device)
#Do the training stuff here
Doing it the second way is ~7x faster than the first, common, way (and it doesnt destroys my CPU).
Disadvantage: I cant shuffle every epoch. I shuffled all data points when I created these batches and saved them locally, but at train time it is not possible to shuffle datapoints, it is just possible to rearange the order of the batches (At the moment I don’t know how this affects performance).
I’m now a bit confused: is my new approach okay, do I have a bug in the previous, usual, approach that is causing the long data load times? How bad is the lack of shuffling before each epoch? Of course, it would be more elegant to use dataloaders, however, I have not yet found a way to make this run anywhere near as fast as my new, second, approach.