Multiple DataLoaders: why is the first DataLoader slower than the other?

I have two datasets, each one having 60k RGB images (320x320) saved as .pt files, and two DataLoaders (one for each dataset). Both DataLoaders have equal parameters: batch size=256, num_workers = 6, shuffle = True, drop_last=True. At each epoch, I get batches from the DataLoaders as in this pseudocode:

data_generator1 = data.DataLoader(...)
data_generator2 = data.DataLoader(...)

num_batches = len(data_generator1) # note: generators have same len

# tools for measuring execution times
start_batch1 = torch.cuda.Event(enable_timing=True)
end_batch1 = torch.cuda.Event(enable_timing=True)
start_batch2 = torch.cuda.Event(enable_timing=True)
end_batch2 = torch.cuda.Event(enable_timing=True)

for epoch in range(epochs):
    t_batch1 = []
    t_batch2 = []
    generator_iterator1 = iter(data_generator1)
    generator_iterator2 = iter(data_generator2)
    for i in range(num_batches):
            # Get a batch from both dataloaders
            batch1, labels1 = next(generator_iterator1)

            batch2, labels2 = next(generator_iterator2)
        except StopIteration:
             # dataloaders are empty

        # Train model....

    print("Avg time batch1: ", sum(t_batch1)/len(t_batch1))
    print("Avg time batch2: ", sum(t_batch2)/len(t_batch2))

Now, the problem is that batch1, labels1 = next(generator_iterator1) on average is 5x slower than batch2, labels2 = next(generator_iterator2), i.e. 1500 millseconds vs 300 milliseconds (with 60000/256 = 234 batches in total, this results in 5.85 minutes vs 1.17 minutes per epoch). Of course, I would like both to take the minimum amount of time in order to speed up the training process. At first I thought the problem was on the data (maybe dataset1 is “heavier” than dataset2, or maybe I saved them in the .pt files as different types of data). However, I swapped the commands order in the code by placing batch2, labels2 = next(generator_iterator2) first and then batch1, labels1 = next(generator_iterator1). And guess what? Now batch2, labels2 = next(generator_iterator2) is 5x slower. So it’s clearly not the kind of data, but rather which DataLoader is fetched first. In other words, no matter which dataloader I fetch, the first dataloader to be fetched is always the slowest to return the batches. Does anybody know why this is happening?

EDIT: I forgot to say that I am using pytorch on Ubuntu.

How did you measure the loading time?
Note that both DataLoaders use multiple workers, which load the data in the background.
If you just stopped the time after batch1 was returned and then again after batch2, you have to consider that generator_iterator2 was working the whole time in the background.

Depending on your current system, you might also want to play around with the number of total workers, as 12 processes seem to be quite high.

I edited my question by showing how I measure the loading time (see the above pseudocode). I also changed the number of workers (2, 3, 4, 6, 12), in each case the first dataloader is always the slowest one to return batches. However, this is not the case when num_workers=1, in which case the first dataloader is not always the slowest. For example, with num_workers = 1 I get these results:

#Epoch 0
Avg time batch1:  1455.2
Avg time batch2:  274.23

#Epoch 1
Avg time [ms] batch1:  35.30
Avg time [ms] batch2:  1379.09

#Epoch 2
Avg time [ms] batch1:  973.86
Avg time [ms] batch2:  716.08

#Epoch 3
Avg time [ms] batch1:  8.59
Avg time [ms] batch2:  1817.66

#Epoch 4
Avg time [ms] batch1:  1077.48
Avg time [ms] batch2:  787.06

#Epoch 5
Avg time [ms] batch1:  25.82
Avg time [ms] batch2:  1840.94

Thus, with num_workers = 1 one of the two dataloaders is often much slower than the other, though it’s not always the first dataloader to be the slowest. I should also specify that I am using a google compute engine (i.e. a google cloud machine with just 2 CPUs and a Tesla K80 GPU). I don’t understand why there is such a big difference in terms of loading time between the two dataloaders. Ideally, I would like them to take the same amount of time (and possibly the shortest time). What should I do in this sense? If I merge both datasets into one unique dataloader, would that make any difference?