So I investigated it further and in deed opening HDF5 introduces huge overhead. I’ve tested it on this code: https://github.com/piojanu/World-Models (my implementation of the World Models (further WM) paper, the memory training is written in PyTorch). Note: the code I link here doesn’t have multiprocessing data preloading capabilities, I test it in the private repo.
I use the Pyflame profiler to profile the WM’s memory module training for 30s with sampling every 1ms on HW: Intel® Core™ i7-7700 CPU @ 3.60GHz with GeForce GTX 1060 6GB.
Experiments:
- With data loading in main process (DataLoader’s
num_worker
= 0) and opening hdf5 file each time in__getitem__
:- Batches per second: ~0,18
- Most of the time data is being loaded, above 70% of the profiling time.
- Opening the hdf5 file takes 20% of the profiling time!
- Then we have data preprocessing and mem copy in last 10% of the profiling time.
- Training one layer LSTM on the GPU is so fast, that the profiler didn’t catch it.
- With data loading in main process (DataLoader’s
num_worker
= 0) and opening hdf5 file once in__getitem__
:- Batches per second: ~2
- Still most of the time data is being loaded, ~90% of the profiling time.
- There is no overhead from opening the hdf5 file of course, that’s why larger proportion of time went to loading the data.
- Profiler was able to catch couple of samples of LSTM training, still below 1% of the profiling time.
- With data loading in worker processes (DataLoader’s
num_worker
= 4) and opening hdf5 file once in__getitem__
:- Batches per second: ~5,1
-
There is no overhead from opening the hdf5 file and loading data is successfully covered with GPU execution. DataLoader’s
__next__
operation (getting next batch) in main process takes below 1% of the profiling time and we have full utilisation of GTX1060! Win
My recommendations:
- Use HDF5 in version 1.10 (better multiprocessing handling),
- Because an opened HDF5 file isn’t pickleable and to send Dataset to workers’ processes it needs to be serialised with pickle, you can’t open the HDF5 file in
__init__
. Open it in__getitem__
and store as the singleton!. Do not open it each time as it introduces huge overhead. - Use
DataLoader
withnum_workers
> 0 (reading from hdf5 (i.e. hard drive) is slow) andbatch_sampler
(random access to hdf5 (i.e. hard drive) is slow).
Sample code:
class H5Dataset(torch.utils.data.Dataset):
def __init__(self, path):
self.file_path = path
self.dataset = None
with h5py.File(self.file_path, 'r') as file:
self.dataset_len = len(file["dataset"])
def __getitem__(self, index):
if self.dataset is None:
self.dataset = h5py.File(self.file_path, 'r')["dataset"]
return self.dataset[index]
def __len__(self):
return self.dataset_len