DataLoader, when num_worker >0, there is bug

dreamyun · September 21, 2018, 6:29am

import h5py
import numpy as np

import torch
from torch.utils.data import Dataset, DataLoader

class H5Dataset(Dataset):
    def __init__(self, h5_path):
        self.h5_path = h5_path
        self.h5_file = h5py.File(h5_path, 'r')
        self.length = len(h5py.File(h5_path, 'r'))
    
    def __getitem__(self, index):
        record = self.h5_file[str(index)]
        return (
            record['data'].value,
            record['target'].value,
        )
        
    def __len__(self):
        return self.length

# --
# Make data

f = h5py.File('test.h5')
for i in range(256):
    f['%s/data' % i] = np.random.uniform(0, 1, (1024, 1024))
    f['%s/target' % i] = np.random.choice(1000)


# Runs correctly
dataloader = torch.utils.data.DataLoader(
    H5Dataset('test.h5'),
    batch_size=32,
    num_workers=0,
    shuffle=True
)

for i,(data,target) in enumerate(dataloader):
    print(data.shape)
    if i > 10:
        break

# Throws error (sometimes, may have to restart python)
dataloader = torch.utils.data.DataLoader(
    H5Dataset('test.h5'),
    batch_size=32,
    num_workers=8,
    shuffle=True
)

for i,(data,target) in enumerate(dataloader):
    print(data.shape)
    if i > 10:
        break

# KeyError: 'Traceback (most recent call last):
#   File "/home/bjohnson/.anaconda/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 55, in _worker_loop
#     samples = collate_fn([dataset[i] for i in batch_indices])
#   File "<stdin>", line 11, in __getitem__
#   File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
#   File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
#   File "/home/bjohnson/.anaconda/lib/python2.7/site-packages/h5py/_hl/group.py", line 167, in __getitem__
#     oid = h5o.open(self.id, self._e(name), lapl=self._lapl)
#   File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
#   File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
#   File "h5py/h5o.pyx", line 190, in h5py.h5o.open
#   KeyError: Unable to open object (bad object header version number)

The formated nicely code can be find here:

gist.github.com

https://gist.github.com/bkj/f448025fdef08c0609029489fa26ea2a#file-h5py-error-py

h5py-error.py

import h5py
import numpy as np

import torch
from torch.utils.data import Dataset, DataLoader

class H5Dataset(Dataset):
    def __init__(self, h5_path):
        self.h5_path = h5_path
        self.h5_file = h5py.File(h5_path, 'r')

This file has been truncated. show original

dreamyun · September 21, 2018, 6:32am

Many pepole has the errors ,but there is no solution available. I hope the pytorch developers can help us solve the problem. Thanks in advance.

ptrblck · September 22, 2018, 1:45pm

Which PyTorch and h5py versions are you using?
Your code runs fine on my machine and returns all 8 batches using both DataLoaders.

dreamyun · September 22, 2018, 2:07pm

Thanks for your reply.
My pytorch version is '0.4.1.post2' and the version of h5py is '2.8.0'.
My error encountered is : when num_worker>1,

KeyError: 'Traceback (most recent call last):
File “/home/bjohnson/.anaconda/lib/python2.7/site-packages/torch/utils/data/dataloader.py”, line 55, in _worker_loop
samples = collate_fn([dataset[i] for i in batch_indices])
File “”, line 11, in getitem
File “h5py/_objects.pyx”, line 54, in h5py._objects.with_phil.wrapper
File “h5py/_objects.pyx”, line 55, in h5py._objects.with_phil.wrapper
File “/home/bjohnson/.anaconda/lib/python2.7/site-packages/h5py/_hl/group.py”, line 167, in getitem
oid = h5o.open(self.id, self._e(name), lapl=self._lapl)
File “h5py/_objects.pyx”, line 54, in h5py._objects.with_phil.wrapper
File “h5py/_objects.pyx”, line 55, in h5py._objects.with_phil.wrapper
File “h5py/h5o.pyx”, line 190, in h5py.h5o.open
KeyError: Unable to open object (bad object header version number)

dreamyun · September 22, 2018, 2:14pm

In stackoverflow, there is a answer to this error:

I encountered the very same issue, and after spending a day trying to marry PyTorch DataParallel loader wrapper with HDF5 via h5py, I discovered that it is crucial to open h5py.File inside the new process, rather than having it opened in the main process and hope it gets inherited by the underlying multiprocessing implementation.
Since PyTorch seems to adopt lazy way of initializing workers, this means that the actual file opening has to happen inside of the getitem function of the Dataset wrapper.

So, I modified my code according to the answer and run well.

def __init__(self,hdf5file,imgs_key='images',labels_key='labels',
             transform=None):
   
    self.hdf5file=hdf5file

   
    self.imgs_key=imgs_key
    self.labels_key=labels_key
    self.transform=transform
def __len__(self):

    # return len(self.db[self.labels_key])
    with h5py.File(self.hdf5file, 'r') as db:
        lens=len(db[self.labels_key])
    return lens
def __getitem__(self, idx):


    with h5py.File(self.hdf5file,'r') as db:
        image=db[self.imgs_key][idx]
        label=db[self.labels_key][idx]
    sample={'images':image,'labels':label}
    if self.transform:
        sample=self.transform(sample)
    return samplemple

the with sentence are the main modification. In original code, we do not close the file object returned by the h5py.File method. After modification, the file object is closed in the __len__ and __getitem__ method. So, I want to know the close of file object returned by the h5py.File is necessary?

dreamyun · September 22, 2018, 2:30pm

Hi, I know you why do not reproduce the error. you can reproduce the error by commenting or deleting the piece of code, then run the left codes.

dataloader = torch.utils.data.DataLoader(
H5Dataset('test.h5'),
batch_size=32,
num_workers=0,
shuffle=True
 )

for i, (data, target) in enumerate(dataloader):
    print(data.shape)
if i > 10:
    break

dreamyun · September 22, 2018, 2:50pm

If you do not delete or comment the piece of code:

dataloader = torch.utils.data.DataLoader(
    H5Dataset('test.h5'),
    batch_size=32,
    num_workers=0,
    shuffle=True
)

count1=0
for i, (data, target) in enumerate(dataloader):
    # print(data.shape)
    count1+=target
print('count1 is equal to \n{}:'.format(count1))

There is no error reported, but the target returned is no right. This is can be illustrated in the following script:

import h5py
import numpy as np

import torch
from torch.utils.data import Dataset, DataLoader


class H5Dataset(Dataset):
    def __init__(self, h5_path):
        self.h5_path = h5_path
        self.h5_file = h5py.File(h5_path, 'r')
        self.length = len(h5py.File(h5_path, 'r'))

    def __getitem__(self, index):
        record = self.h5_file[str(index)]
        return (
            record['data'].value,
            record['target'].value,
        )

    def __len__(self):
        return self.length


# --
# Make data

# f = h5py.File('test.h5')
# for i in range(256):
#     f['%s/data' % i] = np.random.uniform(0, 1, (1024, 1024))
#     f['%s/target' % i] = np.random.choice(10)

# Runs correctly
dataloader = torch.utils.data.DataLoader(
    H5Dataset('test.h5'),
    batch_size=32,
    num_workers=0,
    shuffle=True
)

count1=0
for i, (data, target) in enumerate(dataloader):
    # print(data.shape)
    count1+=target
print('count1 is equal to \n{}:'.format(count1))
    # if i > 10:
    #     break

# Throws error (sometimes, may have to restart python)
dataloader = torch.utils.data.DataLoader(
    H5Dataset('test.h5'),
    batch_size=32,
    num_workers=24,
    shuffle=True
)

count2=0
for i, (data, target) in enumerate(dataloader):
    # print(data.shape)
    # print(target.shape)
    count2+=target
    # if i > 10:
    #     break
print('count2 is equal to :\n{}'.format(count2))

The output is as follows:

count1 is equal to
tensor([49, 24, 51, 44, 38, 37, 33, 26, 44, 37, 44, 39, 53, 46, 29, 28, 31, 35,
28, 42, 32, 34, 35, 28, 44, 22, 32, 39, 40, 34, 34, 30]):

count2 is equal to :
tensor([ -40011464938561591, 36, -4640496823984442177,
4596629407495505507, 9180213616633249665, -4690268498061449214,
4565213464737575129, -4677703075192396625, 4551168039923175676,
-50017473233541162, 4552094186216834730, 4556169682132194915,
9164131086913356889, -45353263880063689, -4707432270418525107,
-84909258148334466, 4567934797786953787, 4561996622320321486,
4567793662555240134, 9170571099677535639, 4549180305499652268,
9149618509619540126, -38544862363653458, -4641323993829215895,
-33862949871259449, 4560804812378475854, 4573418027288026608,
4559805757876447575, 4564365524252319916, 9136588953091962375,
-4645116466038797201, 9210419057565000560])
In fact, count1 should be equal to count2.

I hope these reproduced errors are helpful for your developers! Thank you in advance.

ptrblck · September 23, 2018, 12:31pm

Thanks for the debugging and pointing out strange issues.
I can reproduce the issues following your information.
It seems that multiprocessing doesn’t work well with HDF5/h5py.
Here is a similar issue with a link to the known problem.

Using the spawn method doesn’t solve the issue in this case. I’ll dig a bit deeper.

dreamyun · September 23, 2018, 1:08pm

import h5py
import numpy as np

import torch
from torch.utils.data import Dataset, DataLoader


class H5Dataset(Dataset):
    def __init__(self, h5_path):
        self.h5_path = h5_path
        # self.h5_file = h5py.File(h5_path, 'r')
        # self.length = len(h5py.File(h5_path, 'r'))

    def __getitem__(self, index):
        with h5py.File( self.h5_path,'r') as record:

            data=record[str(index)]['data'].value
            target=record[str(index)]['target'].value
        return (data,target)

    def __len__(self):
        with h5py.File(self.h5_path,'r') as record:
            return len(record)


# --
# Make data

f = h5py.File('test.h5')
for i in range(256):
    f['%s/data' % i] = np.random.uniform(0, 1, (1024, 1024))
    f['%s/target' % i] = np.random.choice(10)

# Runs correctly
# dataloader = torch.utils.data.DataLoader(
#     H5Dataset('test.h5'),
#     batch_size=32,
#     num_workers=0,
#     shuffle=False
# )
#
# count1=0
# for i, (data, target) in enumerate(dataloader):
#     # print(data.shape)
#     count1+=target
# print('count1 is equal to \n{}:'.format(count1))
# print(torch.sum(count1))
    # if i > 10:
    #     break

# Throws error (sometimes, may have to restart python)
dataloader = torch.utils.data.DataLoader(
    H5Dataset('test.h5'),
    batch_size=32,
    num_workers=24,
    shuffle=False
)

count2=0
for i, (data, target) in enumerate(dataloader):
    # print(data.shape)
    # print(target.shape)
    count2+=target
    # if i > 10:
    #     break
print('count2 is equal to :\n{}'.format(count2))
print(torch.sum(count2))

If we modify like this, it can work well. But, I think in the __getitem__, for obtainning each item, a h5py.File is carried out, which should be a expensive operation. I hope the so-called right answer can help you to debug further. Thank you in advance.

sytrus-pytorch · September 29, 2018, 11:31pm

You could open the HDF5 file lazily and keep it open by wrapping it in a generator. Use it together with “spawn” method seems to do the trick:

import os
import h5py
import numpy as np

import torch
from torch.utils.data import Dataset, DataLoader


class H5Dataset(Dataset):
    def __init__(self, h5_path):
        self.h5_path = h5_path
        self._h5_gen = None

    def __getitem__(self, index):
        if self._h5_gen is None:
            self._h5_gen = self._get_generator()
            next(self._h5_gen)
        return self._h5_gen.send(index)

    def _get_generator(self):
        with h5py.File( self.h5_path, 'r') as record:
            index = yield
            while True:
                data=record[str(index)]['data'].value
                target=record[str(index)]['target'].value
                index = yield data, target

    def __len__(self):
        with h5py.File(self.h5_path,'r') as record:
            return len(record)


if __name__ == '__main__':
    torch.multiprocessing.set_start_method('spawn')
    # # --
    # Make data if not there

    if not os.path.exists('test.h5'):
        print('making test.h5')
        f = h5py.File('test.h5')
        for i in range(256):
            f['%s/data' % i] = np.random.uniform(0, 1, (1024, 1024))
            f['%s/target' % i] = np.random.choice(10)
        f.close()
        print('done')

    # Runs correctly
    dataloader = torch.utils.data.DataLoader(
        H5Dataset('test.h5'),
        batch_size=32,
        num_workers=0,
        shuffle=True
    )

    count1=0
    for i, (data, target) in enumerate(dataloader):
        # print(data.shape)
        count1+=target
    print('count1 is equal to: \n{}'.format(count1.sum()))
        # if i > 10:
        #     break

    # Also runs correctly
    dataloader = torch.utils.data.DataLoader(
        H5Dataset('test.h5'),
        batch_size=32,
        num_workers=4,
        shuffle=True
    )

    count2=0
    for i, (data, target) in enumerate(dataloader):
        # print(data.shape)
        # print(target.shape)
        count2+=target
        # if i > 10:
        #     break
    print('count2 is equal to: \n{}'.format(count2.sum()))

Does this work in your case?

piojanu · February 1, 2019, 5:39am

Hi!

@ptrblck any update on this? Any words of wisdom how to work with HDF5 files efficiently using multiple processes?

Thanks,
Piotr

shy · February 1, 2019, 8:40pm

Hope to see any better solution to work with hdf5 files.

ptrblck · February 2, 2019, 4:28am

Unfortunately I haven’t looked into this issue further, as the suggested solution from @sytrus-pytorch seems to run correctly. I’m not sure how large the overhead would be to reopen the HDF5 file continuously.
Did you try this approach?

piojanu · February 2, 2019, 2:06pm

Yea, I’ve explored topic a bit and what I found is:

With version 1.8 of HDF5 library working with HDF5 files and multiprocessing is a lot messier (not h5py! I mean HDF5 library installed on your system: https://unix.stackexchange.com/questions/287974/how-to-check-if-hdf5-is-installed). I highly recommend to update the library to 1.10 version where multiprocessing works better. I was only able to get h5py to work with “with” statement and this seems to give huge overhead, but I didn’t have time to investigate it properly:

class H5Dataset(Dataset):
    def __init__(self, h5_path):
        self.h5_path = h5_path

    def __getitem__(self, index):
        with h5py.File(self.h5_path, 'r') as file:
            # Do something with file and return data

    def __len__(self):
        with h5py.File(self.h5_path,'r') as file:
            return len(file["dataset"])

In version 1.10 of HDF5 library I was able to create h5py.File once in __getitem__ and reuse it without errors.

class H5Dataset(Dataset):
    def __init__(self, h5_path):
        self.h5_path = h5_path
        self.file = None

    def __getitem__(self, index):
        if self.file is None:
            self.file = h5py.File(self.h5_path, 'r')
        # Do something with file and return data

    def __len__(self):
        with h5py.File(self.h5_path,'r') as file:
            return len(file["dataset"])

Carsten_Ditzel · February 2, 2019, 5:35pm

so then it seems to be a problem relating to hdf5 in general in combination with multirpocessing rather than with Pytorch itself? If so, it would be interesting so see, how this issue evolves when using the C++ frontend

piojanu · February 5, 2019, 11:42am

So I investigated it further and in deed opening HDF5 introduces huge overhead. I’ve tested it on this code: https://github.com/piojanu/World-Models (my implementation of the World Models (further WM) paper, the memory training is written in PyTorch). Note: the code I link here doesn’t have multiprocessing data preloading capabilities, I test it in the private repo.
I use the Pyflame profiler to profile the WM’s memory module training for 30s with sampling every 1ms on HW: Intel® Core™ i7-7700 CPU @ 3.60GHz with GeForce GTX 1060 6GB.

Experiments:

With data loading in main process (DataLoader’s num_worker = 0) and opening hdf5 file each time in __getitem__:
- Batches per second: ~0,18
- Most of the time data is being loaded, above 70% of the profiling time.
- Opening the hdf5 file takes 20% of the profiling time!
- Then we have data preprocessing and mem copy in last 10% of the profiling time.
- Training one layer LSTM on the GPU is so fast, that the profiler didn’t catch it.
With data loading in main process (DataLoader’s num_worker = 0) and opening hdf5 file once in __getitem__:
- Batches per second: ~2
- Still most of the time data is being loaded, ~90% of the profiling time.
- There is no overhead from opening the hdf5 file of course, that’s why larger proportion of time went to loading the data.
- Profiler was able to catch couple of samples of LSTM training, still below 1% of the profiling time.
With data loading in worker processes (DataLoader’s num_worker = 4) and opening hdf5 file once in __getitem__:
- Batches per second: ~5,1
- There is no overhead from opening the hdf5 file and loading data is successfully covered with GPU execution. DataLoader’s __next__ operation (getting next batch) in main process takes below 1% of the profiling time and we have full utilisation of GTX1060! Win

My recommendations:

Use HDF5 in version 1.10 (better multiprocessing handling),
Because an opened HDF5 file isn’t pickleable and to send Dataset to workers’ processes it needs to be serialised with pickle, you can’t open the HDF5 file in __init__. Open it in __getitem__ and store as the singleton!. Do not open it each time as it introduces huge overhead.
Use DataLoader with num_workers > 0 (reading from hdf5 (i.e. hard drive) is slow) and batch_sampler (random access to hdf5 (i.e. hard drive) is slow).

Sample code:

class H5Dataset(torch.utils.data.Dataset):
    def __init__(self, path):
        self.file_path = path
        self.dataset = None
        with h5py.File(self.file_path, 'r') as file:
            self.dataset_len = len(file["dataset"])

    def __getitem__(self, index):
        if self.dataset is None:
            self.dataset = h5py.File(self.file_path, 'r')["dataset"]
        return self.dataset[index]

    def __len__(self):
        return self.dataset_len

Carsten_Ditzel · February 5, 2019, 12:26pm

thank you mate! A working but cumbersome workaround

ptrblck · February 5, 2019, 12:34pm

Thanks for the detailed profiling and analysis! The code looks great and I think it’ll be useful for a lot of users here in the board.

Carsten_Ditzel · February 5, 2019, 12:55pm

a major problem, however, might be, that the entire data set is loaded into RAM, thus preventing slicing and an effective on-demand loading of data.

also, is there a reason why my GPU work load (1080TI) is droppling regularly for a short period time to 0 % before going up to 100% again when using 4 workers as compared to 0 where I did not observe this problem?

piojanu · February 5, 2019, 2:14pm

I don’t think that the entire data set is loaded into RAM. HDF5 file can keep TBs of data (or certainly more then you have RAM in your station) and this code should still work fine. HDF5 (with h5py interface in Python) will make sure to keep in RAM only needed data (currently accessed slice).
Of course, if you have too big batch size and DataLoader’s prefetch queue size (or some copying that isn’t freed/garbage collected), it might not fit into RAM. You can control batch size in DataLoader parameter.

Is the dropping happening before each new epoch? If yes, then I see it too. This is caused by DataLoader initialisation (forking processes/spawning workers, initialising queues etc.). This overhead isn’t observed when num_worker = 0 of course, when you don’t need to spawn workers etc.