How to speed up the data loader

In my experience, I would first build an HDF5 file with all your images, which you can build easily following the documentation of h5py on http://docs.h5py.org/en/latest/. During training, build a class inheriting from Dataset which returns your images. Something along this line:

class dataset_h5(torch.utils.data.Dataset):
def __init__(self, in_file):
    super(dataset_h5, self).__init__()

    self.file = h5py.File(in_file, 'r')
    self.n_images, self.nx, self.ny = self.file['images'].shape

def __getitem__(self, index):
    input = self.file['images'][index,:,:]
    return input.astype('float32')

def __len__(self):
    return self.n_images

Then you can build your loader with:

self.train_loader = torch.utils.data.DataLoader(dataset_h5(train_file),
                                                    batch_size=16, shuffle=True)

My experience with this approach has been quite positive, with the GPU always at 100% even though I’m loading pretty heavy images.

35 Likes