How to deal with large files efficiently?

Ximeng · July 13, 2022, 9:27am

Hi,
I have some problems dealing with tons of time series data.
Data: Each file contains a lot of samples. 2G/file * 2000 files = 4T
Device: Availble RAM=60G, GPU memory=20G.
**Sampling:**Need to pool the samples as much as possible to make them shuffled(simulate IID) and downsampled(deal with imbalanced distribution).

My current strategy:

Use multithreading/queue load files into RAM.
When RAM is roughly half occupied, pool the files, do the shuffling and downsampling work and start training.
Continue to load rest files until RAM is fully occupied while GPU is training.

As long as training time for a chunk of data is longer than I/O time, opening files shouldn’t be a bottleneck in this process. But is there a more elegant or efficient way to handle the situation? Thanks！

ParGG · July 13, 2022, 10:24am

Instead of loading the data at the beginning, could you work with the paths of the files and then load the data just once you are done with shuffling and downsampling? To deal with downsampling, you can store the data in different directories according to the class, and apply your sampling strategy. To really provide you with some concrete help one might need to look at the code though. Hopefully I could help you anyways

Ximeng · July 14, 2022, 1:34am

Thank you for quick reply. The tricky thing is my data is time series instead of pictures. The data in a single file is an extremely long sequence in a form like (10000,1). And I will create roughly 9900 samples with a lookback window size of 100 in an online manner which means it is impossible to store the data in different directories according to the class beforehand.

ParGG · July 14, 2022, 2:32pm

10’000 datapoints in float32 is 40kB. How can you get such large memory requirements? Consider that a file audio contains 44.1k samples for each second (16bits though) and weights a few MB for a full song.

Maybe if you tell me more about this I can come up with a more suitable solution

Ximeng · July 15, 2022, 12:59am

Sorry for confusion. I was just trying to give an numerical example.

The real data is (XXXXX, 200+) ndarray matrix(or tensor) in a single file. And they are created with torch.save method, so probably “.pth” format. I know there are other potential options such as feather, hdf5, jay, pickle, parquet, npy or npz, but I find the so called “.pth” format has a good balance between I/O speed and space efficience. Highly zipped formats naturally take more time to open files and some specific format designed for “table data” usually take more time when converting to numpy or tensor.

The reason I don’t create samples in an offline manner is that they will take up much more storage space. Each sample is in a form like (1, 512, 200+)(my lookback window length is 512) and that will take up about 512 times more disk storage space which is unacceptable.

Online method numerical example:
Original data in a file:[5,1] windows_size=3 then I will have [123,1],[234,1],[345,1] three samples. Since the data is consistent in time demension, the online method is tractable and space friendly.

Thank you for patience and attention