DataLoader slows down exponentially

pbjarterot · November 18, 2021, 12:01pm

I have two different dataloaders for the same dataset, the first one runs through the model agnostically but the second takes the 315 consecutive images from a csv file as each batch.

The iteration basically identical except one layer in the model and measuring only the time for this code snippet it increases seemingly exponentially:

data_loading_time = time.time()
for batch_idx, (data2, _) in enumerate(train_loader2):
    data_loading_time2 = time.time()
    print("data loading time", data_loading_time2  - data_loading_time)

The code for the dataloaders is here:
the loop seems to use more of the cpu (~60%) with num_workers=0 compared to num_workers=8 where it only uses 12%.
(the second epoch using num_workers=0 had cpu usage at 30%)
How do i structure my dataloader to not increase time to load data exponentially?

train_set, test_set = torch.utils.data.random_split(dataset, [int(0.8*len(dataset)), int(0.2*len(dataset))])
train_loader = DataLoader(dataset=train_set, batch_size=2048, shuffle=True, **kwargs)
test_loader = DataLoader(dataset=test_set, batch_size=2048, shuffle=True, **kwargs)

train_set2, test_set2 = torch.utils.data.Subset(dataset, [i for i in range(0, 504000)]), torch.utils.data.Subset(dataset, [i for i in range(504000, 630000)])
train_loader2 = DataLoader(dataset = train_set2, batch_size=315, shuffle=False, num_workers=0, pin_memory=True)
test_loader2 = DataLoader(dataset = test_set2, batch_size=315, shuffle=False, num_workers=0, pin_memory=True)