Define iterator on Dataloader is very slow

ptrblck · January 27, 2020, 6:32am

Thanks for the code snippet!
I think the File() might just open the file handle, but not actually read the data from your disk, which might be performed in the indexing.

I’ve created this dummy code snippet to play around with it a bit:

# Setup
d1 = np.random.random(size = (1000,1000, 100))
hf = h5py.File('data.h5', 'w')
hf.create_dataset('dataset_1', data=d1)
hf.close()

# Load
t0 = time.time()
hf = h5py.File('data.h5', 'r')
t1 = time.time()
print('open took {:.3f}ms'.format((t1-t0)*1000))

t0 = time.time()
n1 = hf.get('dataset_1')
t1 = time.time()
print('get took {:.3f}ms'.format((t1-t0)*1000))

t0 = time.time()
n1 = np.array(n1)
t1 = time.time()
print('reading array took {:.3f}ms'.format((t1-t0)*1000))

t0 = time.time()
data = hf['dataset_1']
t1 = time.time()
print('get took {:.3f}ms'.format((t1-t0)*1000))

nb_iters = 100
t0 = time.time()
for idx in np.random.randint(0, 1000, (nb_iters,)):
    x = data[idx]
t1 = time.time()
print('random index takes {:.3f}ms per index'.format((t1 - t0)/nb_iters * 1000))

I’m no expert in using hdf5, but I it seems that the indexing takes a lot more time than the reading of the file, which might point to lazy loading.

This seems to match @rasbt’s post.