At the beginning of the DataLoader loop, all workers will spawn, initialize the Datasets, and start loading their batches. Depending how expensive the Dataset.__init__ method is, this could already take some time (especially if you are loading and processing data). Afterwards, each worker will load batch_size samples by calling batch_size times into Dataset.__getitem__ and create the batch via the collate_fn.
You should check where the bottleneck in your code is and which part takes the majority of the time.
E.g. data loading from a mounted network drive would most likely be the bottleneck and you should consider copying to the node. Without knowing more about your use case I can only speculate and would recommend to profile the code.
My dataset code loads the whole dataset onto RAM during __init__ then applies transformations in __getitem__ part. Since the __getitem__ part only gets run when the dataloader loop is actually being run (as oppose to creating the dataloader iterator), I believe that the cause of my problem is the __init__ part of the dataset, which only loads and does basic normalization. I’ll try to move the normalization part of the code to the __getitem__ to reduce time. Along with this, are there other ways to solve this?(for ex : is there a way of spawning the workers and doing dataset.__init__ before the actual dataloader iterator is being made to reduce time?)
I have also have some questions :
I thought that since the dataset was created before the dataloader iterator is made, there would be no time spent concerning the dataset during the creation of the dataloader iterator. Is there a way to create multiple same datasets in advance (before creating the dataloader iterator, or while the dataloader iteration is being run (i.e. preparing for the next epoch (new dataloader iterator)) so that the time can be reduced?
I thought that when I set workers as (for example) 4, with batch size 32, each worker loads 8 images so that in total 32 images can be made during collate_fn. However, your response seems to imply that each worker makes their own (whole) batch, then supply them to the GPU for training. If so, am I right in thinking that keeping on increasing the num_workers will at some point have no effect on decreeing data loading time since at some point since the speed of batch loading will be higher than the actual computation time of GPU for each batch?