Sorry if the title is unclear, formulating a short one for my question is a bit tricky.
I’m working on an already existing project and I’m questioning whether or not the data loading part is running in parallel or there’s inactive workers.
Let’s assume the data is loaded the following way:
class Dataset(torch.utils.data.Dataset): def __init__(...): self.minibatches = [ ... #Load list of minibatch indices with batch size 16 ] def __getitem__(self, index): return self.minibatches[index] def custom_collate_fn(minibatch): data =  for i in range(len(minibatch)): data.append( Load(minibatch[i]) ) ... #code to pad and convert to tensor ... return data train_dataset = Dataset(...) training_loader = torch.utils.data.Dataloader(dataset=train_dataset, batch_size = 1, num_workers = 8, collate_fn = custom_collate_fn)
Data loading can’t possibly run in parallel because of that for loop, right? Even though it seems fast with a batch size of 16, we’re losing potential speedup for a big dataset in a whole epoch, right?