DataLoader, when num_worker >0, there is bug

piojanu · February 5, 2019, 11:42am

So I investigated it further and in deed opening HDF5 introduces huge overhead. I’ve tested it on this code: https://github.com/piojanu/World-Models (my implementation of the World Models (further WM) paper, the memory training is written in PyTorch). Note: the code I link here doesn’t have multiprocessing data preloading capabilities, I test it in the private repo.
I use the Pyflame profiler to profile the WM’s memory module training for 30s with sampling every 1ms on HW: Intel® Core™ i7-7700 CPU @ 3.60GHz with GeForce GTX 1060 6GB.

Experiments:

With data loading in main process (DataLoader’s num_worker = 0) and opening hdf5 file each time in __getitem__:
- Batches per second: ~0,18
- Most of the time data is being loaded, above 70% of the profiling time.
- Opening the hdf5 file takes 20% of the profiling time!
- Then we have data preprocessing and mem copy in last 10% of the profiling time.
- Training one layer LSTM on the GPU is so fast, that the profiler didn’t catch it.
With data loading in main process (DataLoader’s num_worker = 0) and opening hdf5 file once in __getitem__:
- Batches per second: ~2
- Still most of the time data is being loaded, ~90% of the profiling time.
- There is no overhead from opening the hdf5 file of course, that’s why larger proportion of time went to loading the data.
- Profiler was able to catch couple of samples of LSTM training, still below 1% of the profiling time.
With data loading in worker processes (DataLoader’s num_worker = 4) and opening hdf5 file once in __getitem__:
- Batches per second: ~5,1
- There is no overhead from opening the hdf5 file and loading data is successfully covered with GPU execution. DataLoader’s __next__ operation (getting next batch) in main process takes below 1% of the profiling time and we have full utilisation of GTX1060! Win

My recommendations:

Use HDF5 in version 1.10 (better multiprocessing handling),
Because an opened HDF5 file isn’t pickleable and to send Dataset to workers’ processes it needs to be serialised with pickle, you can’t open the HDF5 file in __init__. Open it in __getitem__ and store as the singleton!. Do not open it each time as it introduces huge overhead.
Use DataLoader with num_workers > 0 (reading from hdf5 (i.e. hard drive) is slow) and batch_sampler (random access to hdf5 (i.e. hard drive) is slow).

Sample code:

class H5Dataset(torch.utils.data.Dataset):
    def __init__(self, path):
        self.file_path = path
        self.dataset = None
        with h5py.File(self.file_path, 'r') as file:
            self.dataset_len = len(file["dataset"])

    def __getitem__(self, index):
        if self.dataset is None:
            self.dataset = h5py.File(self.file_path, 'r')["dataset"]
        return self.dataset[index]

    def __len__(self):
        return self.dataset_len