Parralellization in the case of using batch size of 1 with each being a list

Khumbaba · May 11, 2023, 3:05pm

Hello,

Sorry if the title is unclear, formulating a short one for my question is a bit tricky.
I’m working on an already existing project and I’m questioning whether or not the data loading part is running in parallel or there’s inactive workers.
Let’s assume the data is loaded the following way:

class Dataset(torch.utils.data.Dataset):
	def __init__(...):
		self.minibatches = [
			... #Load list of minibatch indices with batch size 16
		]
	def __getitem__(self, index):
		return self.minibatches[index]
		
def custom_collate_fn(minibatch):
    data = []
    for i in range(len(minibatch)):
        data.append( Load(minibatch[i]) )
        ... #code to pad and convert to tensor ...
    return data
  
train_dataset = Dataset(...)
training_loader = torch.utils.data.Dataloader(dataset=train_dataset, batch_size = 1, num_workers = 8, collate_fn = custom_collate_fn)

Data loading can’t possibly run in parallel because of that for loop, right? Even though it seems fast with a batch size of 16, we’re losing potential speedup for a big dataset in a whole epoch, right?

Khumbaba · May 13, 2023, 5:11pm

Turns out, I had a wrong understanding of how pytorch handles the workers. They work in parallel with each worker handling a separate batch. They don’t fetch samples in parallel to fill a single batch.