Looping over dataloader takes to long

Hello everyone, I have a validation dataset of approximately 1800 images, and I’ve built this dataset based on the CocoDataset. I’ve loaded this dataset into a PyTorch DataLoader using the following code.

batch_size = 32
num_workers = 6
sampler = torch.utils.data.SequentialSampler(data_val)

data_loader = DataLoader(dataset, 
                         batch_size, 
                         sampler=sampler,
                         collate_fn=utils.collate_fn, 
                         num_workers=num_workers)

I want to loop through this data for training, but when I try to print the iterations, it’s very slow.

for obj in data_loader:
    print(obj)

What could be causing the iterations to be so slow, and what should I do? i using collate_fn from here

Thank you very much

You could profile the data loading pipeline and could try to narrow down where the bottleneck is. E.g. the actual loading from drive could be slow, especially if you are using a spinning HDD, or the data processing if your CPU is not fast enough.

How can data loading be profiled? I have tried this method, but I still can’t see the results or get past the first iteration.

from torch.profiler import profile, record_function

with profile(record_shapes=True, profile_memory=True, use_cuda=True,) as prof:
    with record_function("dataloader"):
        for obj in data_loader:
            print(obj)

Is your script hanging or is it just slow?

sorry, it turns out my code is still stuck
heres my data reader code

class PneumoDetection(data.Dataset):
  def __init__(self, root, annotation, transform=None):
    self.root = root
    self.annotation = annotation
    self.transform = transform
    self.coco = COCO(annotation)
    self.ids = list(sorted(self.coco.imgs.keys()))

  def __len__(self):
    return len(self.ids)

  def __getitem__(self,index):
    coco = self.coco
    img_id = self.ids[index]
    ann_ids = coco.getAnnIds(imgIds = img_id)
    
    coco_annotation = coco.loadAnns(ann_ids)
    path = coco.loadImgs(img_id)[0]['file_name']
    img = Image.open(f"{self.root}/{path}")

    obj_len = len(coco_annotation)

    boxes = []
    areas = []
    for i in range(obj_len):
      xmin = coco_annotation[i]['bbox'][0]
      ymin = coco_annotation[i]['bbox'][1]
      xmax = xmin + coco_annotation[i]['bbox'][2]
      ymax = ymin + coco_annotation[i]['bbox'][3]
      boxes.append([xmin,ymin,xmax,ymax])
      areas.append(coco_annotation[i]['area'])
    
    boxes = torch.as_tensor(boxes, dtype=torch.float32)
    areas = torch.as_tensor(areas, dtype=torch.float32)

    labels = torch.ones((obj_len,), dtype = torch.int64)
    img_id = torch.tensor([img_id])
    iscrowd = torch.zeros((obj_len,), dtype = torch.int64)

    result_annotation = {}
    result_annotation['boxes'] = boxes
    result_annotation['labels'] = labels
    result_annotation['iscrowd'] = iscrowd
    result_annotation['areas'] = areas
    result_annotation['img_id'] = img_id

    if self.transform is not None:
      img = self.transform(img)

    return img, result_annotation

    
def get_transform():
  custom_transform = []
  custom_transform.append(torchvision.transforms.ToTensor())
  return torchvision.transforms.Compose(custom_transform)

def collate_fn(batch):
  return tuple(zip(*batch))

and heres my inisiation

train_path = r"Dataset\rsna\processed_train"
val_path = r"Dataset\rsna\processed_test"
ann_train = 'processed_train.json'
ann_val = 'processed_test.json'

data_train = PneumoDetection(train_path, ann_train, transform=get_transform())
data_val = PneumoDetection(val_path, ann_val, transform=get_transform())

and this is my data loader code

batch_size = 2
num_workers = 6

sampler_train = torch.utils.data.RandomSampler(data_train)
sampler_val = torch.utils.data.SequentialSampler(data_val)

batch_sampler_train = torch.utils.data.BatchSampler(
    sampler_train, batch_size, drop_last=True)

data_loader_train = DataLoader(data_train, batch_sampler=batch_sampler_train,
                                collate_fn=collate_fn, num_workers=num_workers)
data_loader_val = DataLoader(data_val, batch_size, sampler=sampler_val,
                                drop_last=False, collate_fn=collate_fn, num_workers=num_workers)

when I try to iterate like the code below, the code runs non-stop

chek = next(iter(data_loader_train))
chek

Can you please tell me where my code is wrong, thank you.

I don’t know what’s causing the hang, but you could try to set the number of workers to zero to check if this would work.

It really does. thank you so much! any ideas why the number of workers impacts the iteration?

In your setup the impact is a hang and should not happen. However, I don’t know what’s causing it as I was never able to reproduce the issue and it might be system-specific.