How to make my custom Dataloader faster?

I hava stored my input images in hdf5 files, each training, evaluation and testing in a separate group. Each group contains datasets ‘inputs’ and ‘labels’. I do this, because I cannot load my whole dataset in the memory. The dataset class looks the following:

class dataset_h5(torch.utils.data.Dataset):
    """
    Reads in a dataset
    """
    def __init__(self, in_file, mode = 'training'):
        super(dataset_h5, self).__init__()
        self.cuda = torch.device('cuda:0')

        self.file = h5py.File(in_file, 'r')[mode]
        self.n_images, self.channels,self.nx, self.ny = self.file['inputs'].shape
        self.n_images_check, self.nfeatures = self.file['labels'].shape

        if self.n_images != self.n_images_check:
            print('Number of input samples does not match number of label samples!')

        norm_set = h5py.File(in_file, 'r')['normalization']
        self.data_mean = norm_set['mean'][0]
        self.data_std = norm_set['std'][0]
        self.transform =transforms.Normalize((self.data_mean,), (self.data_std,))


    def __getitem__(self, index):
        input = self.transform(torch.tensor(self.file['inputs'][index,:,:].astype('float32')))
        labels = torch.tensor(self.file['labels'][index,:].astype('float32'))
        # Transform to tensor to move to GPU
        return input.to(self.cuda), labels.to(self.cuda)


    def __len__(self):
        return self.n_images

Then I load the data and create a dataloader, like this:

train = util.dataset_h5(datafile_name, mode = 'training')
train_loader = torch.utils.data.DataLoader(train, batch_size=batch_size, shuffle = False, num_workers = 0)

But the process is insanely slow. I don’t know why, but loading a batch from the dataset takes the same time as training the network. Does anybody know how I could improve my dataset function?
Thanks!

Your num_workers equal 0 means that the main process does the data loading when needed.
What instance are you running this in? Based on the number of cores, set the num_workers to 2 * num_cores. This way, several num_workers number of subprocesses will be launched to load the images.

Thank you! I changed my dataset to output CPU tensors to support multiprocessing. Then I’m moving the each batch to GPU after loading. It is still not very fast, but already better than before.

We can make it faster by tuning the batch size. Could you give a break up of time taken to load the data + forward pass + backward pass per batch?

I was actually able to speed up the dataloader. Using pin_memory=True helped a lot, and I figured out I was storing some tensors attached to the computation graph in my early stopping function, which made the process a bit slower over time.
Right now I am using batches of 64 samples, the time to grab a batch is around 0.4 s and the overheads for forward + backward pass are around 22s. I am using 440x400 images since the features are very small.
P.s.: Times were measured using time.time(), which is probably wrong for the dataloading, since I am using num_workers = 8.

1 Like

how did you manage to use a number of workers > 0 with your dataloader using hdf5? that should cause several issues related to the hdf5 format that can be avoided by opening the file in the get_item method instead of the init method

1 Like

Yes, you are right about the issues. Here is the updated Dataset function:

class dataset_h5(torch.utils.data.Dataset):
    """
    Reads in a dataset
    """
    def __init__(self, in_file, mode = 'training'):
        super(dataset_h5, self).__init__()
        self.file_path = in_file
        self.dataset_mode = mode
        self.dataset = None

        with h5py.File(self.file_path, 'r') as file:
            self.n_images, self.channels,self.nx, self.ny = file[self.dataset_mode]['inputs'].shape
            self.n_images_check, self.nfeatures = file[self.dataset_mode]['labels'].shape

        if self.n_images != self.n_images_check:
            print('Number of input samples does not match number of label samples!')

        norm_set = h5py.File(in_file, 'r')['normalization']
        self.data_mean = norm_set['mean'][0]
        self.data_std = norm_set['std'][0]
        self.transform =transforms.Normalize((self.data_mean,), (self.data_std,))

    def __getitem__(self, index):
        if self.dataset is None:
            self.dataset = h5py.File(self.file_path, 'r')[self.dataset_mode]
        input = self.transform(torch.tensor(self.dataset['inputs'][index, :, :].astype('float32')))
        labels = torch.tensor(self.dataset['labels'][index, :].astype('float32'))

        return input, labels

thanks for the update James.

Did you, by any chance, try to use zarr instead of hdf5? Is is supposed to circumvent this very issue but I did not have the time to dig deeper into this, particularly since zarr does not work with hdf5 files directly but requires files to be stored with zarr as well.

Also, how does your GPU utilization look like with your code above? Do you spot any drops in workload after iterations?

No worries. I have not done any work with zarr before, and I am not planning to since the speed is not an issue anymore at the moment. GPU utilization stays the same over the whole training. Weirdly, I had some issues before, depending on the pytorch version, as mentioned above. But using pin_memory=True seemed to solve it.

Update: While this code was running neatless in version ‘1.0.1’, it somehow fails under ‘1.0.1.post2’. The following error appears after different time periods:

KeyError: 'Traceback (most recent call last):\n File "/home/catrueeb/.local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 138, in _worker_loop\n samples = collate_fn([dataset[i] for i in batch_indices])\n File "/home/catrueeb/.local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 138, in <listcomp>\n samples = collate_fn([dataset[i] for i in batch_indices])\n File "/home/catrueeb/utils/cnn_utils.py", line 45, in __getitem__\n input = self.transform(torch.tensor(self.dataset[\'inputs\'][index, :, :].astype(\'float32\')))\n File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper\n File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper\n File "/home/catrueeb/.local/lib/python3.7/site-packages/h5py/_hl/group.py", line 262, in __getitem__\n oid = h5o.open(self.id, self._e(name), lapl=self._lapl)\n File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper\n File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper\n File "h5py/h5o.pyx", line 190, in h5py.h5o.open\nKeyError: \'Unable to open object (wrong B-tree signature)\'\n'

Edit:
torch.multiprocessing.set_start_method('spawn')
solved it

1 Like