DataLoader, when num_worker >0, there is bug

So I investigated it further and in deed opening HDF5 introduces huge overhead. I’ve tested it on this code: https://github.com/piojanu/World-Models (my implementation of the World Models (further WM) paper, the memory training is written in PyTorch). Note: the code I link here doesn’t have multiprocessing data preloading capabilities, I test it in the private repo.
I use the Pyflame profiler to profile the WM’s memory module training for 30s with sampling every 1ms on HW: Intel® Core™ i7-7700 CPU @ 3.60GHz with GeForce GTX 1060 6GB.

Experiments:

  1. With data loading in main process (DataLoader’s num_worker = 0) and opening hdf5 file each time in __getitem__:
    • Batches per second: ~0,18
    • Most of the time data is being loaded, above 70% of the profiling time.
    • Opening the hdf5 file takes 20% of the profiling time!
    • Then we have data preprocessing and mem copy in last 10% of the profiling time.
    • Training one layer LSTM on the GPU is so fast, that the profiler didn’t catch it.
  2. With data loading in main process (DataLoader’s num_worker = 0) and opening hdf5 file once in __getitem__:
    • Batches per second: ~2
    • Still most of the time data is being loaded, ~90% of the profiling time.
    • There is no overhead from opening the hdf5 file of course, that’s why larger proportion of time went to loading the data.
    • Profiler was able to catch couple of samples of LSTM training, still below 1% of the profiling time.
  3. With data loading in worker processes (DataLoader’s num_worker = 4) and opening hdf5 file once in __getitem__:
    • Batches per second: ~5,1
    • There is no overhead from opening the hdf5 file and loading data is successfully covered with GPU execution. DataLoader’s __next__ operation (getting next batch) in main process takes below 1% of the profiling time and we have full utilisation of GTX1060! Win :wink:

My recommendations:

  • Use HDF5 in version 1.10 (better multiprocessing handling),
  • Because an opened HDF5 file isn’t pickleable and to send Dataset to workers’ processes it needs to be serialised with pickle, you can’t open the HDF5 file in __init__. Open it in __getitem__ and store as the singleton!. Do not open it each time as it introduces huge overhead.
  • Use DataLoader with num_workers > 0 (reading from hdf5 (i.e. hard drive) is slow) and batch_sampler (random access to hdf5 (i.e. hard drive) is slow).

Sample code:

class H5Dataset(torch.utils.data.Dataset):
    def __init__(self, path):
        self.file_path = path
        self.dataset = None
        with h5py.File(self.file_path, 'r') as file:
            self.dataset_len = len(file["dataset"])

    def __getitem__(self, index):
        if self.dataset is None:
            self.dataset = h5py.File(self.file_path, 'r')["dataset"]
        return self.dataset[index]

    def __len__(self):
        return self.dataset_len
39 Likes