How to speed up training for large dataset

Hello, everyone.
I have large dataset with 3 million images for face related project.
Torchvision Dataloader has been used for training. However training is very slow.
So I have plan to use HDF5 or LMDB format dataset.
What do you upvote in them?

Hi @alexcruz0202,

Could you maybe share the code you are using to load the data and train the model, so that people in the community might give suggestions on how to make the code more efficient and faster perhaps?

Hypothetically speaking, training might be slow because

  1. The data loading / preprocessing pipeline is not optimized for execution
  2. DL model isn’t being trained on an efficient hardware such as a GPU

If you could give more info around your overall pipeline, you would get an appropriate help from the community

@faizan_shaikh. Sorry for late response.
This is my codesnipet for dataloader.

class TrainDataset(Dataset):
    def __init__(self, root="", live_sub_dirs=[], fake_sub_dirs=[]):
        self.live_sub_dirs = live_sub_dirs
        self.fake_sub_dirs = fake_sub_dirs

        self.pos_filelist = get_total_list(root, live_sub_dirs)
        self.neg_filelist = get_total_list(root, fake_sub_dirs)

        count = len(self.neg_filelist) // len(self.pos_filelist)
        self.total_filelist = np.asarray(self.pos_filelist * count + self.neg_filelist, dtype=np.str)
        np.random.shuffle(self.total_filelist)

        self.transform = transforms.Compose([
            transforms.RandomCrop(128),
            transforms.RandomHorizontalFlip(),
            transforms.ToTensor()
        ])
        self.aug = strong_aug(0.5)

    def __getitem__(self, idx):
        img_path = self.total_filelist[idx]
        img = Image.open(img_path).convert("RGB")
        img = self.aug(image=np.array(img))["image"]
        img = self.transform(Image.fromarray(img))

        target = 0 if str('fake') in img_path.lower() else 1

        return img, torch.tensor(target, dtype=torch.long)
    
    def __len__(self):
        return len(self.total_filelist)

Out of the overall code, this part seems much more computationally expensive, and could be made more efficient. If you don’t mind, could you clarify a few things?

  1. Is it necessary to convert the images to RBG? If yes, you can probably do it beforehand.
  2. What is the use of strong_aug function? Can’t it incorporated in self.transform? Also, you could use the library “albumentations” to make faster data augmentations on the fly
  3. Conversion of PIL image to numpy array and back seems counter intuitive, if they are a part of same Data augmentation pipeline. You could make it more seemless somehow

Along with this, the part where you do random shuffling isn’t required if you rely on pytorch DataLoader’s “shuffle” argument

Hope this helps

I’m trying to check every lines of my dataloader.
Thank you for your suggestion.