Custom dataloader using Python Multiprocessing and num_workers in torch dataloader

Ajinkya_Ambatwar · March 23, 2021, 1:58pm

Hi, I have a custom dataloader. I have explicitly used python’s multiprocessing to parallelize data preprocessing in my custom dataloader. I am using 8 workers(num_threads) in multiprocessing in my dataLoader.
I wanted to know, how will that affect my torch.utils.data.DataLoader call?
Will the num_workers argument be set to 8? Or can I leave it at 0?
My custom Loader looks like this

def processParams(params):
        <some operations on params>
        return params


def processParamsParallel(params, pool):
      results = pool.map(processParams, params)
      return results

class DataLoader(object):
      def __init__(self, params, maxId):
             self.params = params
             self.id = 0
             self.maxId = maxId
             self.pool = Pool(processes=8)
      def __iter__(self):
             while self.id<=self.maxId:
                 if self.id==self.maxId:
                       self.id = 0
                 results = processParamsParallel(self.params, self.pool)
                 self.id+=1
                 yield results

It’s a very rough example of what I am trying to do.
Now in the torch call

dl = DataLoader(params, 50)
dl_torch = torch.utils.data.Dataloader(dl, num_workers = <what_here?>, prefetch_factor = <what_here?>)

On a similar note, how will prefetch_factor be affected given that the num_workers are not set by torch call but by the custom dataLoader itself?

From torch.utils.data

prefetch_factor - Number of samples loaded in advance by each worker. 2 means there will be a total of 2 * num_workers samples prefetched across all workers. (default: 2 )

Thank you in Advance!