Creating an iterator for `DataLoader` takes too much time

I have noticed that when the enumerate(loader) is called, considerable time is required, as the code snippet below and the time points show :

code snippiet :

  start_time = time.time()
    scaler = torch.cuda.amp.GradScaler()
    for epoch in range(start_epoch, args.epochs):
        print(f"rank {args.rank} set epoch start : {time.time() - start_time}")
        sampler.set_epoch(epoch)
        print(f"rank {args.rank} set epoch end : {time.time()- start_time}")
        for step, ((y1, y2), _) in enumerate(loader, start=epoch * len(loader)):
            print(f"rank {args.rank} step start : {time.time()- start_time}")
            y1 = y1.cuda(gpu, non_blocking=True)
            y2 = y2.cuda(gpu, non_blocking=True)

The time taken was :

rank 0 set epoch start : 4.7206878662109375e-05
rank 0 set epoch end : 8.702278137207031e-05
rank 0 step start : 72.13791012763977
{"epoch": 0, "step": 0, "lr_weights": 0.0, "lr_biases": 0.0, "loss": 13827.8291015625, "time": 82}
rank 0 step start : 82.55937671661377
{"epoch": 0, "step": 1, "lr_weights": 3.765060240963856e-06, "lr_biases": 9.036144578313253e-08, "loss": 12604.126953125, "time": 83}
rank 0 step start : 83.55196452140808

As the timepoints above shows, it takes about 70 seconds for the iterator to be made, while only 1 second for each iteration (i.e. step) to be taken.

This delay in the iterator creating amounts to 30% of the total computation time be wasted while not utilizing the GPU.

My questions are :

  1. Is this normal behavior?
  2. What could be the cause of this?
  3. Is there way to remedy this?

p.s. One potentially pertinent information is that the batch size is 32, with dataloader num_workers = 8.

thank you in advance for any of your insights and replies :slight_smile:

At the beginning of the DataLoader loop, all workers will spawn, initialize the Datasets, and start loading their batches. Depending how expensive the Dataset.__init__ method is, this could already take some time (especially if you are loading and processing data). Afterwards, each worker will load batch_size samples by calling batch_size times into Dataset.__getitem__ and create the batch via the collate_fn.
You should check where the bottleneck in your code is and which part takes the majority of the time.
E.g. data loading from a mounted network drive would most likely be the bottleneck and you should consider copying to the node. Without knowing more about your use case I can only speculate and would recommend to profile the code.

Thank you for the reply @ptrblck !

My dataset code loads the whole dataset onto RAM during __init__ then applies transformations in __getitem__ part. Since the __getitem__ part only gets run when the dataloader loop is actually being run (as oppose to creating the dataloader iterator), I believe that the cause of my problem is the __init__ part of the dataset, which only loads and does basic normalization. I’ll try to move the normalization part of the code to the __getitem__ to reduce time. Along with this, are there other ways to solve this?(for ex : is there a way of spawning the workers and doing dataset.__init__ before the actual dataloader iterator is being made to reduce time?)

I have also have some questions :

  1. I thought that since the dataset was created before the dataloader iterator is made, there would be no time spent concerning the dataset during the creation of the dataloader iterator. Is there a way to create multiple same datasets in advance (before creating the dataloader iterator, or while the dataloader iteration is being run (i.e. preparing for the next epoch (new dataloader iterator)) so that the time can be reduced?
  2. I thought that when I set workers as (for example) 4, with batch size 32, each worker loads 8 images so that in total 32 images can be made during collate_fn. However, your response seems to imply that each worker makes their own (whole) batch, then supply them to the GPU for training. If so, am I right in thinking that keeping on increasing the num_workers will at some point have no effect on decreeing data loading time since at some point since the speed of batch loading will be higher than the actual computation time of GPU for each batch?