HDF5 Multi Threaded Alternative

sauhaardac · August 14, 2017, 6:17am

We use HDF5 for our dataset, our dataset consists of the following:
12x94x168 (12 channel image it’s three RGB images) byte tensor
128x23x41 (Metadata input (additonal input to the net)) binary tensor
1x20 (Target data or “labels”) byte tensor (really 0-100)

We have lots of data stored in numpy arrays inside hdf5 (2.8 TB) which we then load and convert in a PyTorch Dataset object. The problem that we recently ran into is that HDF5 doesn’t support multi threaded data access with num_workers > 1 on the data loading. Our GPUs are capable of 1k Hz processing of these data points however this limits us to only 200 Hz. We are open to changing the data format, but need to do it quickly. I know this is an open ended question but it would be great if you all could suggest some alternatives options for us to speed up training.

QuantScientist · August 14, 2017, 10:39am

This may help you:

github.com

h5py/h5py/blob/master/examples/multiprocessing_example.py

# This file is part of h5py, a Python interface to the HDF5 library.
#
# http://www.h5py.org
#
# Copyright 2008-2013 Andrew Collette and contributors
#
# License:  Standard 3-clause BSD; see "license.txt" for full license terms
#           and contributor agreement.

"""
    Demonstrates how to use h5py with the multiprocessing module.

    This module implements a simple multi-process program to generate
    Mandelbrot set images.  It uses a process pool to do the computations,
    and a single process to save the results to file.

    Importantly, only one process actually reads/writes the HDF5 file.
    Remember that when a process is fork()ed, the child inherits the HDF5
    state from its parent, which can be dangerous if you already have a file
    open.  Trying to interact with the same file on disk from multiple

This file has been truncated. show original

"Concurrent access to one or more HDF5 file(s) from multiple threads in the same process will not work with a non-thread-safe build of the HDF5 library. The pre-built binaries that are available for download are not thread-safe.

Users are often surprised to learn that (1) concurrent access to different datasets in a single HDF5 file and (2) concurrent access to different HDF5 files both require a thread-safe version of the HDF5 library. Although each thread in these examples is accessing different data, the HDF5 library modifies global data structures that are independent of a particular HDF5 dataset or HDF5 file. HDF5 relies on a semaphore around the library API calls in the thread-safe version of the library to protect the data structure from corruption by simultaneous manipulation from different threads. Examples of HDF5 library global data structures that must be protected are the freespace manager and open file lists."

Also:
http://cyrille.rossant.net/moving-away-hdf5/

hdfhelp · August 16, 2017, 5:56pm

It appears that you are just reading the HDF5 files, in which case, “QuantScientist (Solomon K)”'s suggestion to use multi-processing rather than multi-threading should be all you need. HDF5 supports multiple readers already, and with HDF5-1.10, the SWMR (Single Writer Multiple Reader) feature was also introduced.

HDF Helpdesk

stefbraun · November 16, 2017, 1:48pm

@hdfhelp

The usuage pattern is:

open hdf5
fork processes
read

However, the forked processes will corrupt the open state of the hdf5. Is there any way around this, e.g. synchronized file opening? Opening the hdf5 in the forked processes is no alternative as this is too slow (penalty to open hdf5).

hdfhelp · November 16, 2017, 8:51pm

There may not be a good solution for you. You cannot do multi-threading it seems, nor can you do multi-process opens since opening the file is too expensive. It is unclear why the fork/then use of the new process is failing, but it is not surprising that there are issues.

High levels of concurrency is a problem that needs to be addressed soon.

smth · November 16, 2017, 9:29pm

using python3 and adding this to the absolute top of your main script will help fix the forking issues:

import torch
if __name__ == "__main__":
    import torch.multiprocessing.set_start_method("spawn")

stefbraun · November 27, 2017, 2:36pm

Thanks a lot for the help so far. After adding torch.multiprocessing.set_start_method("spawn"), there arise new problems:

TypeError: can't pickle File objects

Somehow, file objects or group nodes (pytables reader) cannot be pickled. Is there a way around this?

Regenerator · January 25, 2018, 4:31am

I struggled with the same issue recently and managed to overcome the problem by moving the hdf5 opening code into Dataset.__getitem__(x) method of my custom class. It works and, according to my experiments (I compared data loading speed with the setup where I have multiple small hdf5 files, representing the objects from the big file), it works slightly faster. More important is the fact that it finally works with num_workers > 1 in the DataLoader.

Have you found any other solution? Or maybe considered other options for data storage and access for the case of very large datasets?

Seungwoo_Yoo · July 3, 2018, 8:31am

@Regenerator

Thanks for sharing solutions. I also followed @Regenerator by moving h5py opening codes to getitem. To my experience, it seems not much slow than expected even though opening hdf5 in every iteration. But hope to get much better solution.

Regenerator · July 4, 2018, 3:46pm

It is not necessary to open the file in each iteration. Instead of that, you may check that a class variable, containing the once opened dataset is not empty at each iteration.

atakanokan · December 13, 2018, 5:12am

Can you elaborate on this a bit?
So under getitem:
make an if else statement something like

def __getitem__(self, idx):
        if self.f_open == False:
            f = h5py.File("../filename.hdf5", 'r')["delta_HI"]
            self.f_open = True

        f[c['x'][0]:c['x'][1],
              c['y'][0]:c['y'][1],
              c['z'][0]:c['z'][1]]

Would something like this work better?

I have been experiencing really slow read times (when I call data.next() in my training loop) when I use Google Cloud Compute. However my school’s HPC is 20x faster and I couldn’t find the reason behind this difference.

Any help is appreciated!

EDIT: The code snippet above doesn’t work. Had to change it to:

        f = h5py.File("../filename.hdf5", 'r')["delta_HI"]

        f[c['x'][0]:c['x'][1],
              c['y'][0]:c['y'][1],
              c['z'][0]:c['z'][1]]

VasLem · January 12, 2019, 3:51pm

Although immediate differences will not be visible when not reopening files in __getitem__ , there is a great probability of having spurious Nan values inside the retrieved content, which will definitely cause problems during using hdff5 datasets with Pytorch. For reference, one can see this stackoverflow post . As a sidenote, I have not tried compiling from source code.

Carsten_Ditzel · April 4, 2019, 12:55pm

Although immediate differences will not be visible when not reopening files in __getitem__ , there is a great probability of having spurious Nan values inside the retrieved content, which will definitely cause problems during using hdff5 datasets with Pytorch. For reference, one can see this stackoverflow post . As a sidenote, I have not tried compiling from source code.

is this still true? how can I check for these subtle problems so that I can rely on the data not being compromised?

NumesSanguis · December 6, 2019, 8:44am

HDF5 supports multiple readers already, and with HDF5-1.10, the SWMR (Single Writer Multiple Reader) feature was also introduced.

Once you set your HDF5 to SWMR however, you can never turn it off again. That means, you can never add a new dataset (only extending existing ones):

github.com/h5py/h5py

SWMR cannot be turned off once set to True; Allow forceful switching off

opened 02:44AM - 05 Dec 19 UTC

closed 05:09PM - 08 Dec 19 UTC

NumesSanguis

SWMR

* Operating System: import h5py; print(h5py.version.info) * Python version: 3.…7.5 (default, Oct 25 2019, 15:51:11) * Where Python was acquired: Miniconda3 * h5py version: 2.10.0 * HDF5 version: 1.10.5 * numpy: 1.17.3 When using SWMR, are you limited to creating datasets only at creation time? Or are you supposed to be able to turn-off SWMR when it is not accessed by any other process? Turning off SWMR is not possible (Jupyter Notebook, kernel restarted): ```python arr = np.array([.4, -.1, -.5, 8]) h5 = h5py.File("swmr_test.h5", 'w', libver='latest') h5["np"] = arr h5.swmr_mode = True h5.swmr_mode = False ``` ValueError: ```python --------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-6-22f0396ab8ff> in <module> ----> 1 h5.swmr_mode = False h5py/_objects.pyx in h5py._objects.with_phil.wrapper() h5py/_objects.pyx in h5py._objects.with_phil.wrapper() /opt/miniconda3/envs/audio_tester/lib/python3.7/site-packages/h5py/_hl/files.py in swmr_mode(self, value) 312 self._swmr_mode = True 313 else: --> 314 raise ValueError("It is not possible to forcibly switch SWMR mode off.") 315 316 def __init__(self, name, mode=None, driver=None, ValueError: It is not possible to forcibly switch SWMR mode off. ``` If this is intended, it would be nice to clarify this in the documentation. ----- I know you don't manage HDF5 itself, but could you help me with my thought process? If with SWMR you're limited to only dataset creation at file creation time, isn't useless for almost any real environment setting? At least in terms of it being a database. In my case I'm working with [Apache Airflow](https://airflow.apache.org/), meaning that independent process at various times will mostly read, but sometime write to a database. Since I'm dealing with audio data, columnar storage file formats are not the solution. At first I thought Parallel HDF5 would be the solution here with MPI. However, you need to execute it with `mpiexec`, meaning you need to coordinate them and you cannot trigger it from something like Airflow. SWMR also seemed good, because to create new datasets you only needed to turn off SWMR for short moments (small risk at blocking readers is acceptable in my case). Then data can be added without interrupting readers, and without coordination. But this doesn't seem to be how SWMR works. Is it possible from h5py's side to implement a forceful method to set the file back to `h5.swmr_mode = False`? Even if this would block readers, it would make SWMR much more useful. ----- Related issue: https://github.com/h5py/h5py/issues/712

BSODjunkie · June 11, 2020, 9:55am

So I figured out that the part where I use with open inside the __init__ method seems to be the issue, when I don’t do that it seems to work. Basically as long as I only open the HDF5 file within the __getitem__ and __len__ methods it works.