I have two different dataloaders for the same dataset, the first one runs through the model agnostically but the second takes the 315 consecutive images from a csv file as each batch.
The iteration basically identical except one layer in the model and measuring only the time for this code snippet it increases seemingly exponentially:
data_loading_time = time.time() for batch_idx, (data2, _) in enumerate(train_loader2): data_loading_time2 = time.time() print("data loading time", data_loading_time2 - data_loading_time)
The code for the dataloaders is here:
the loop seems to use more of the cpu (~60%) with num_workers=0 compared to num_workers=8 where it only uses 12%.
(the second epoch using num_workers=0 had cpu usage at 30%)
How do i structure my dataloader to not increase time to load data exponentially?
train_set, test_set = torch.utils.data.random_split(dataset, [int(0.8*len(dataset)), int(0.2*len(dataset))]) train_loader = DataLoader(dataset=train_set, batch_size=2048, shuffle=True, **kwargs) test_loader = DataLoader(dataset=test_set, batch_size=2048, shuffle=True, **kwargs) train_set2, test_set2 = torch.utils.data.Subset(dataset, [i for i in range(0, 504000)]), torch.utils.data.Subset(dataset, [i for i in range(504000, 630000)]) train_loader2 = DataLoader(dataset = train_set2, batch_size=315, shuffle=False, num_workers=0, pin_memory=True) test_loader2 = DataLoader(dataset = test_set2, batch_size=315, shuffle=False, num_workers=0, pin_memory=True)