Completed code with bug report for hdf5 dataset. How to fix?

Hello all, I want to report the issue of pytorch with hdf5 loader. The full source code and bug are provided
The problem is that I want to call the test_dataloader.py in two terminals. The file is used to load the custom hdf5 dataset (custom_h5_loader). To generate h5 files, you may need first run the file convert_to_h5 to generate 100 random h5 files.
To reproduce the error. Please run follows steps

Step 1: Generate the hdf5

from __future__ import print_function
import h5py
import numpy as np
import random
import os

if not os.path.exists('./data_h5'):
        os.makedirs('./data_h5')

for index in range(100):
    data = np.random.uniform(0,1, size=(3,128,128))
    data = data[None, ...]
    print (data.shape)
    with h5py.File('./data_h5/' +'%s.h5' % (str(index)), 'w') as f:
        f['data'] = data

Step2: Create a python file custom_h5_loader.py and paste the code

import h5py
import torch.utils.data as data
import glob
import torch
import numpy as np
import os
class custom_h5_loader(data.Dataset):

    def __init__(self, root_path):
        self.hdf5_list = [x for x in glob.glob(os.path.join(root_path, '*.h5'))]
        self.data_list = []
        for ind in range (len(self.hdf5_list)):
            self.h5_file = h5py.File(self.hdf5_list[ind])
            data_i = self.h5_file.get('data')     
            self.data_list.append(data_i)

    def __getitem__(self, index):
        self.data = np.asarray(self.data_list[index])   
        return (torch.from_numpy(self.data).float())

    def __len__(self):
        return len(self.hdf5_list)

Step 3: Create a python file with name test_dataloader.py

from dataloader import custom_h5_loader
import torch
import torchvision.datasets as dsets

train_h5_dataset = custom_h5_loader('./data_h5')
h5_loader = torch.utils.data.DataLoader(dataset=train_h5_dataset, batch_size=2, shuffle=True, num_workers=4)      
for epoch in range(100000):
    for i, data in enumerate(h5_loader):       
        print (data.shape)

Step 4: Open first terminal and run (it worked)

python test_dataloader.py

Step 5: Open the second terminal and run (Error report in below)

python test_dataloader.py

The error is

Traceback (most recent call last):
  File "/home/john/anaconda3/lib/python3.6/site-packages/h5py/_hl/files.py", line 162, in make_fid
    fid = h5f.open(name, h5f.ACC_RDWR, fapl=fapl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5f.pyx", line 78, in h5py.h5f.open
OSError: Unable to open file (unable to lock file, errno = 11, error message = 'Resource temporarily unavailable')

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/john/anaconda3/lib/python3.6/site-packages/h5py/_hl/files.py", line 165, in make_fid
    fid = h5f.open(name, h5f.ACC_RDONLY, fapl=fapl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5f.pyx", line 78, in h5py.h5f.open
OSError: Unable to open file (unable to lock file, errno = 11, error message = 'Resource temporarily unavailable')

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "test_dataloader.py", line 5, in <module>
    train_h5_dataset = custom_h5_loader('./data_h5')
  File "/home/john/test_hdf5/dataloader.py", line 13, in __init__
    self.h5_file = h5py.File(self.hdf5_list[ind])
  File "/home/john/anaconda3/lib/python3.6/site-packages/h5py/_hl/files.py", line 312, in __init__
    fid = make_fid(name, mode, userblock_size, fapl, swmr=swmr)
  File "/home/john/anaconda3/lib/python3.6/site-packages/h5py/_hl/files.py", line 167, in make_fid
    fid = h5f.create(name, h5f.ACC_EXCL, fapl=fapl, fcpl=fcpl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5f.pyx", line 98, in h5py.h5f.create
OSError: Unable to create file (unable to open file: name = './data_h5/47.h5', errno = 17, error message = 'File exists', flags = 15, o_flags = c2)

This is my configuration

HDF5 Version: 1.10.2
Configured on: Wed May  9 23:24:59 UTC 2018
Features:
---------
                  Parallel HDF5: no
             High-level library: yes
                   Threadsafety: yes
print (torch.__version__)
1.0.0.dev20181227

how did you get the HDF5 Version printed? then I can check how it looks for me

Let try h5cc -showconfig from terminal

Features:


              Parallel HDF5: no
         High-level library: yes
               Threadsafety: yes
        Default API mapping: v18

With deprecated public symbols: yes
I/O filters (external): deflate(zlib),szip(encoder)
MPE: no
Direct VFD: no
dmalloc: no
Packages w/ extra debug output: none
API tracing: no
Using memory checker: no
Memory allocation sanity checks: no
Metadata trace file: no
Function stack tracing: no
Strict file format checks: no
Optimization instrumentation: no

I think your question is related to the way hdf5 format is specified. there are multiple thread in this board that concern the dataloader issues if combined with hdf5

With your hdf5 features, Could you reproduce my error?

without having the time to reproduce your errors right now. Did I understand correctly that you attempted to access the same hdf5 file from two different training sessions?

I am facing similar issues currently and one remedy that is supposed to work is enabling swmr in recent versions of libhdf5 which is said to resolve the concurrency issues which honestly have left me in doubt about the suitability of this data format for scientific applications all together

Right. I want to access same hdf5 data in two training. The reason is that ai convert my data to hdf5 and run two trainings: ones for baseline and ones for my network. They must use same dataset for fair comparison. I can duplicate two hdf5 dataset folders but it takes a double space disk

have a look at this discussion

Of course, I tried all solution but it does not work. I am looking forward to your reproduce