Memory efficient data streaming for larger-than-memory numpy arrays

The tutorials (such as this one) show how to use torch.utils.data.Dataset to efficiently load large image datasets (lazy loading or data streaming). This is easily applied to images because they usually exist as a folder containing separate files (each sample exists as its own file), and so it’s easy to load just a single image at a time (usually with a csv serving as a manifest that “points” to each image by storing the filename, instead of the pixel data).

However, for other types of data, sometimes we receive a dataset as a gigantic pandas dataframe (maybe stored in an HDF5 file) or as a large numpy .npy file. If these are also large (larger than my memory), how can I use torch.utils.data.Dataset to efficiently stream it?

Does it boil down to stripping out each row (a sample) in the dataframe/matrix and saving that as a separate file, and creating a csv manifest of these files?

2 Likes

you can use a memory-mapped numpy array from your .npy file: https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.memmap.html

That way it still stays on disk and loads the rows you ask on the fly.

4 Likes

Memory mapped numpy arrays have a size limit of like 2~2.5 GB right. How do I deal with larger files ? say 50GB ?

@delta That limit of 2GB size is only for 32bit systems. If you have a 64bit system (Most systems are 64bit nowadays) then you don’t have to worry about it.

I created a 127,545 x 1,234,620 array with np.memmap. If my math is right (127,5451,234,6204/8/1,000,000,000), my array should be 78.73 GB at 4 bits per cell. However, when look at it’s properties it is 586 GB. I’m definitely missing something here. However, I’m just glad it worked and now it has been created. This took about 2 hours to create.

In any event, I need to populate it now and I am deathly afraid of how long it will take to do this. Basically, I have a unique integer that represent each of the rows (so 127,545 integers) and set of integers that represent each column (so 1,234,620 sets of integers) . If the row integer is in the set of column integers, then that row/column cell gets a 1, otherwise 0.

Would love some advice on how to do this as efficiently as possible.