What is the best work flow to store large data sets and be able to maximize GPU utilization while training?

Warrior · March 12, 2018, 5:06pm

I am going to generate and store about 4TBs of data (think of them as images) and am new to data storage methods for such large data sets. Before starting to store the data on disk I want to make sure that I am not going to do something wrong so that I have to regenerate the data again and will also have an easier time reading the data for training models. Just note that the main requirement for me is to be able to load multiple batches at the same time into memory and transfer them to GPU so that GPU utilization is at the peak most of the time.

As far as I know there at two main approaches to store data on disk: using HDF5 files and using Pickle. It seems that there is a 4GB limit for HDF5 files that are to be stored on disk but I’m not sure if there are such limits for Pickle. I only know that it is very easy to dump/load Pickle files. Can someone also describe what are the advantages/disadvantages of using either HDF5 or Pickle for training neural networks given the efficiency requirement I mentioned above?

Also, feel free to mention other tools if you think they are better than HDF5 or Pickle to store such large data sets and maximize GPU utilization.