We use HDF5 for our dataset, our dataset consists of the following:
12x94x168 (12 channel image it’s three RGB images) byte tensor
128x23x41 (Metadata input (additonal input to the net)) binary tensor
1x20 (Target data or “labels”) byte tensor (really 0-100)
We have lots of data stored in numpy arrays inside hdf5 (2.8 TB) which we then load and convert in a PyTorch Dataset object. The problem that we recently ran into is that HDF5 doesn’t support multi threaded data access with num_workers > 1 on the data loading. Our GPUs are capable of 1k Hz processing of these data points however this limits us to only 200 Hz. We are open to changing the data format, but need to do it quickly. I know this is an open ended question but it would be great if you all could suggest some alternatives options for us to speed up training.