RocksDB Dataset class for fast I/O (training)

Foivos_Diakogiannis · November 9, 2023, 3:44am

Dear all,

in my research on remote sensing semantic segmentation I developed a RocksDB based dataset (inspired by the great work on LMDB that was used in some caffe2 tutorials) as an alternative to HDF5 and have found great usage (very fast I/O for deep learning training). Especially given that usually I/O was the bottleneck in DL workflows I develop. In my workflows the data are usually of the format input1, input2, …, inputn, labels1, labels2, …, labelsn where each of these are image-like numpy arrays (or time series of these). So that is the intended use, but the RocksDB framework is quite flexible, no need to be constrained by that.

You can find a short tutorial of the implementation and a basic comparison notebook with HDF5. I ended up with RocksDB given in all things I was reading it was the fastest option (and indeed it proved to be).

Container environment to test it:
docker pull fdiakogiannis/trchprosthesis_requirements:23.10-py3

I hope this prove helpful to some of you.
Cheers