HDF5 Multi Threaded Alternative

We use HDF5 for our dataset, our dataset consists of the following:
12x94x168 (12 channel image it’s three RGB images) byte tensor
128x23x41 (Metadata input (additonal input to the net)) binary tensor
1x20 (Target data or “labels”) byte tensor (really 0-100)

We have lots of data stored in numpy arrays inside hdf5 (2.8 TB) which we then load and convert in a PyTorch Dataset object. The problem that we recently ran into is that HDF5 doesn’t support multi threaded data access with num_workers > 1 on the data loading. Our GPUs are capable of 1k Hz processing of these data points however this limits us to only 200 Hz. We are open to changing the data format, but need to do it quickly. I know this is an open ended question but it would be great if you all could suggest some alternatives options for us to speed up training.

3 Likes

This may help you:

"Concurrent access to one or more HDF5 file(s) from multiple threads in the same process will not work with a non-thread-safe build of the HDF5 library. The pre-built binaries that are available for download are not thread-safe.

Users are often surprised to learn that (1) concurrent access to different datasets in a single HDF5 file and (2) concurrent access to different HDF5 files both require a thread-safe version of the HDF5 library. Although each thread in these examples is accessing different data, the HDF5 library modifies global data structures that are independent of a particular HDF5 dataset or HDF5 file. HDF5 relies on a semaphore around the library API calls in the thread-safe version of the library to protect the data structure from corruption by simultaneous manipulation from different threads. Examples of HDF5 library global data structures that must be protected are the freespace manager and open file lists."

Also:
http://cyrille.rossant.net/moving-away-hdf5/

It appears that you are just reading the HDF5 files, in which case, “QuantScientist (Solomon K)”'s suggestion to use multi-processing rather than multi-threading should be all you need. HDF5 supports multiple readers already, and with HDF5-1.10, the SWMR (Single Writer Multiple Reader) feature was also introduced.

HDF Helpdesk

@hdfhelp

The usuage pattern is:

  1. open hdf5
  2. fork processes
  3. read

However, the forked processes will corrupt the open state of the hdf5. Is there any way around this, e.g. synchronized file opening? Opening the hdf5 in the forked processes is no alternative as this is too slow (penalty to open hdf5).

There may not be a good solution for you. You cannot do multi-threading it seems, nor can you do multi-process opens since opening the file is too expensive. It is unclear why the fork/then use of the new process is failing, but it is not surprising that there are issues.

High levels of concurrency is a problem that needs to be addressed soon.

using python3 and adding this to the absolute top of your main script will help fix the forking issues:

import torch
if __name__ == "__main__":
    import torch.multiprocessing.set_start_method("spawn")
3 Likes

Thanks a lot for the help so far. After adding torch.multiprocessing.set_start_method("spawn"), there arise new problems:

TypeError: can't pickle File objects

Somehow, file objects or group nodes (pytables reader) cannot be pickled. Is there a way around this?

I struggled with the same issue recently and managed to overcome the problem by moving the hdf5 opening code into Dataset.__getitem__(x) method of my custom class. It works and, according to my experiments (I compared data loading speed with the setup where I have multiple small hdf5 files, representing the objects from the big file), it works slightly faster. More important is the fact that it finally works with num_workers > 1 in the DataLoader.

Have you found any other solution? Or maybe considered other options for data storage and access for the case of very large datasets?

5 Likes

@Regenerator

Thanks for sharing solutions. I also followed @Regenerator by moving h5py opening codes to getitem. To my experience, it seems not much slow than expected even though opening hdf5 in every iteration. But hope to get much better solution.

1 Like

It is not necessary to open the file in each iteration. Instead of that, you may check that a class variable, containing the once opened dataset is not empty at each iteration.

1 Like

Can you elaborate on this a bit?
So under getitem:
make an if else statement something like

def __getitem__(self, idx):
        if self.f_open == False:
            f = h5py.File("../filename.hdf5", 'r')["delta_HI"]
            self.f_open = True

        f[c['x'][0]:c['x'][1],
              c['y'][0]:c['y'][1],
              c['z'][0]:c['z'][1]]

Would something like this work better?

I have been experiencing really slow read times (when I call data.next() in my training loop) when I use Google Cloud Compute. However my school’s HPC is 20x faster and I couldn’t find the reason behind this difference.

Any help is appreciated!

EDIT: The code snippet above doesn’t work. Had to change it to:

        f = h5py.File("../filename.hdf5", 'r')["delta_HI"]

        f[c['x'][0]:c['x'][1],
              c['y'][0]:c['y'][1],
              c['z'][0]:c['z'][1]]

Although immediate differences will not be visible when not reopening files in __getitem__ , there is a great probability of having spurious Nan values inside the retrieved content, which will definitely cause problems during using hdff5 datasets with Pytorch. For reference, one can see this stackoverflow post . As a sidenote, I have not tried compiling from source code.

Although immediate differences will not be visible when not reopening files in __getitem__ , there is a great probability of having spurious Nan values inside the retrieved content, which will definitely cause problems during using hdff5 datasets with Pytorch. For reference, one can see this stackoverflow post . As a sidenote, I have not tried compiling from source code.

is this still true? how can I check for these subtle problems so that I can rely on the data not being compromised?

HDF5 supports multiple readers already, and with HDF5-1.10, the SWMR (Single Writer Multiple Reader) feature was also introduced.

Once you set your HDF5 to SWMR however, you can never turn it off again. That means, you can never add a new dataset (only extending existing ones):

So I figured out that the part where I use with open inside the __init__ method seems to be the issue, when I don’t do that it seems to work. Basically as long as I only open the HDF5 file within the __getitem__ and __len__ methods it works.