Unusual approach to loading data leads to significantly better performance

Following the instruction of multiple tutorials regarding custom dataloaders for images, the way to do it is the following: (source)

class Dataset(data.Dataset):
    def __init__(self, list_IDs, labels, transform = None):
        self.labels = labels
        self.list_IDs = list_IDs
        self.transform = transform
        
    def __len__(self):
        return len(self.list_IDs)
    
    def __getitem__(self, index):
        # Select sample
        ID = self.list_IDs[index]
        #load data and get label
        X = torch.load(f"path{ID}.pt")
        if self.transform:
            X = self.transform(X)   
        y = torch.Tensor(self.labels[ID])  
        return X,y

When I implement it this way, my training loop is extremely slow, my CPU usage at 100% and my GPU usage ~2% (thats bad)
So i created a new way loading the data, where I saved the batches before and load one whole batch at a time in the training loop (avoids using torchs dataset and dataloader):

for train_file in train_files: #train_file is a whole batch of 64 images saved as Tensor
            X = torch.load(image_path + f"\\{train_file}", device)
            X = transform(X)
            y = torch.load(r"label_path" + f"\\{train_file}", device)

            #Do the training stuff here

Doing it the second way is ~7x faster than the first, common, way (and it doesnt destroys my CPU).
Disadvantage: I cant shuffle every epoch. I shuffled all data points when I created these batches and saved them locally, but at train time it is not possible to shuffle datapoints, it is just possible to rearange the order of the batches (At the moment I don’t know how this affects performance).

I’m now a bit confused: is my new approach okay, do I have a bug in the previous, usual, approach that is causing the long data load times? How bad is the lack of shuffling before each epoch? Of course, it would be more elegant to use dataloaders, however, I have not yet found a way to make this run anywhere near as fast as my new, second, approach.

The transform seems to be running on GPU in your second method which could explain the better speed. You could implement a buffer to load more than one batch and shuffle among that. It’s not equivalent to a true permutation over the dataset but better than nothing (that’s what tensorflow does: tf.data.Dataset  |  TensorFlow v2.9.1).

How many dataloader workers are you using with the first method?

Not really, i transfer the data in both methods on GPU (snippet of first method, training loop):

for batch, labels in training_generator:
            #Transfer to GPU
            batch, labels = batch.to(device), labels.to(device)

you mean num_workers right? I use the default (AFAIK this is = 0), otherwise it doesnt load at all.

num_workers=0 means your training stops and wait for the next data batch to be fetched, preprocessed, etc. before proceeding further. This is obviously slow and reserved for debugging.
Using workers should work without troubles, you should investigate that.