Time overhead in mini-batching loading

I am training 2 models D and C simultaneously on a training dataset. However, instead of using a train loader to create mini-batches in one go at the beginning of each epoch (let’s call it vanilla loading), I am fetching mini-batches at each iteration. This is because I am using a Weighted Random Sampler where the sample weights are being updated at each iteration.

The issue is that I am facing a huge time overhead as compared to vanilla loading. I used torch.cuda.Event to time the function executions (How to measure time in PyTorch) but am not able to figure where the overhead is coming from. Any help will be deeply appreciated.

    epoch_tic = torch.cuda.Event(enable_timing=True)
    epoch_toc = torch.cuda.Event(enable_timing=True
    epoch_tic.record()

    for epoch in range(1, epochs+1):

        #Other Timer Initialisations 

        iterations = int(len(train_dataset)/batch_size)+1

        iterations_time_tic = torch.cuda.Event(enable_timing=True)
        iterations_time_toc = torch.cuda.Event(enable_timing=True)
        iterations_time_tic.record()

        for iter in range(iterations):

            tic = torch.cuda.Event(enable_timing=True)
            toc = torch.cuda.Event(enable_timing=True)
            tic.record()

            train_loader = torch.utils.data.DataLoader(train_dataset, batch_size = batch_size, num_workers = 1, pin_memory = True, sampler=WeightedRandomSampler(sample_weights, batch_size, replacement=True))

            toc.record()
            torch.cuda.synchronize()

            dataload_Time += tic.elapsed_time(toc)/1000

            for batch_idx, (data, target, data_idx) in enumerate(train_loader):

                    #Training Models D and C and time them
                    #Update Sample Weights at each iteration

                    ........
     
            .........

        iterations_time_toc.record()
        torch.cuda.synchronize()

        iterations_time.append(iterations_time_tic.elapsed_time(iterations_time_toc)/1000)

.............

    epoch_toc.record()
    torch.cuda.synchronize()

    epoch_time.append(epoch_tic.elapsed_time(epoch_toc)/1000)

And these are the execution time values I obtained.
Note 1. All the averaged values shown were averaged across total iterations
Note 2. If I revert back to the vanilla loading, all the numbers add up and you don’t find the overhead in iterations_time as seen here.

Dataloader_time:  [0.0663442611694336]
Avg_Dataloader_time:  [0.00014145897903930404]
TrainD_time:  [8.034557580947876]
TrainC_time:  [13.494966745376587]
Avg_TrainD_time:  [0.01713125283784195]
Avg_TrainC_time:  [0.028773916301442617]
Validation_time:  [8.723237037658691]
update_weights_time:  [0.00391077995300293]
Avg_update_weights_time:  [8.338550006402836e-06]
before_train_time:  [0.00018644332885742188]
after_train_before_validation:  [0.0005524158477783203]
after_validation_before_test:  [0.0006937980651855469]
iterations_time:  [74.6968047618866]
iterations_print_time:  [1.430511474609375e-06]
Test_time:  []
epoch_time:  [83.42318558692932]

By recreating the DataLoader in each iteration you are resetting the workers and would need to load the new batch(es) from the beginning. I would assume you should see the same time using the main process to load the data (num_workers=0), as the overhead of the DataLoader creation should be small.
However, depending on the size of your dataset the creation of the WeightedRandomSampler might add some overhead, so you could profile it for your use case.

Hi @ptrblck. Thank you very much for the reply. I did try out the changes you recommended and obtained the following results:

  1. Execution Times when number of workers = 1:
Dataloader_time:  [0.08958374404907224]
Avg_Dataloader_time:  [0.0001910101152432244]
TrainD_time:  [23.295507202148457]
TrainC_time:  [29.29996321105957]
Avg_TrainD_time:  [0.04967059104935705]
Avg_TrainC_time:  [0.062473269106736826]
Validation_time:  [12.500955078125]
update_weights_time:  [0.0004887039959430695]
Avg_update_weights_time:  [1.0420127845268006e-06]
before_train_time:  [0.0003477120101451874]
after_train_before_validation:  [0.0010117119550704956]
after_validation_before_test:  [0.0006152639985084533]
iterations_time:  [93.8879296875]
iterations_print_time:  [3.0880000442266465e-05]
Test_time:  []
epoch_time:  [106.3928203125]
  1. Execution Times when number of workers = 0:
Dataloader_time:  [0.041077375993132596]
Avg_Dataloader_time:  [8.758502343951514e-05]
TrainD_time:  [16.117475570678714]
TrainC_time:  [20.818791797637946]
Avg_TrainD_time:  [0.03436561955368596]
Avg_TrainC_time:  [0.04438974796937728]
Validation_time:  [12.2724638671875]
update_weights_time:  [0.00019359999895095825]
Avg_update_weights_time:  [4.127931747355187e-07]
before_train_time:  [0.0003356800079345703]
after_train_before_validation:  [0.0005851200222969055]
after_validation_before_test:  [0.0014738880395889283]
iterations_time:  [45.32078125]
iterations_print_time:  [1.2032000347971917e-05]
Test_time:  []
epoch_time:  [57.59690625]

a. There is a remarkable improvement as compared to the earlier settings. However, there is still some visible difference between iterations_time and Dataloader_time + TrainD_time + TrainC_time. Can you help identify what I might be missing here?

b. I am not sure what you mean by profiling the WeightedRandomSampler Calls. As evident from the code, the function execution times should be included in the Dataloader_time, no?

c. I also had one other query. The only reason I am using torch.utils.data.DataLoader is to fetch one mini-batch size of sample data at every iteration. Is there any other (more) efficient way to do the same?

I’m a bit confused by the currently results.
It seems you are not only seeing a speedup using the main process to load the data (instead of another process), but could also reduce the time to update the weights?
Did you change anything else in the code, which might explain the runtime changes?

@ptrblck. No. I did not make any other change except set number of workers = 0