I have large hdf5 database, and have successfully resolved the thread-safety problem by enabling the SWARM feature of hdf5. However, using multiple worker to load my dataset still not achieve normal speed. Typically, I observe the GPU utility circularly rise up to 100%, then drop down to 1%.
Here is my dataset code (seems very naive):
how can i solve this problem? what’s the best pratice to load large hdf5 datasets in pytorch? or I should follow What’s the best way to load large data? to migrate my data into lmdb?
very late reply as I seldom log in here. h5data is only initialized when a worker is initialized, which means this only works if you are actually using a pytorch DataLoader as shown at the bottom of the example, by passing the “worker_init_fn=worker_init_fn” in. Some extra work could make this Dataset more flexible.