Best way to save and load lots of tensors

wasabi · January 21, 2023, 6:22am

I want to preprocess ImageNet data (and I cannot store everything in memory) and store them as tensors on disk, later I want to load them using one dataloader, I wonder what’s the best strategy for this. There are several candidates in my mind:

store a batch of processed tensors in one file, say one tensor for each class, then I end up with 1000 tensors. This is the ideal one in terms of running time, but I don’t know how to load them later using one dataloader in a good way. I can rewrite DatasetFolder — Torchvision main documentation (pytorch.org), but this only allows me to load a whole tensor as “1” data point, I don’t know how to properly sample data in the usual way.
Save each processed image as one tensor file. This is the easiest to implement, but calling torch.save() too many times is too slow. Is there anyway to optimize?
Save batch of tensors in one file like in (1), but later use TensorDataset to load them individually. I don’t want multiple dataloaders for the downstream tasks though, is there a workaround?

Thanks!

ptrblck · January 21, 2023, 7:07am

Could you describe your use case a bit more and explain why you want to store all ImageNet images as tensors?
Note that the size would most likely increase (depending on the original image format) while the actual loading time would depend on the bandwidth vs. decoding performance of your system.
E.g. a JPEG image in the shape [800, 800, 3] uses ~107kB while loading and storing the same data as a tensor uses ~1.9MB.

wasabi · January 21, 2023, 6:03pm

I’m not storing the actual images, I want to store f(x) where f is a neural net and x is the image. Although the dimension of f is still pretty high (nearly 30000). I tried my approach 1, and after processing a very small amount of classes the storage takes several TB so that’s not doable for me.

The reasons that I want to store the processed tensors is because 1. I think the inference f(x) itself may take too much time, and 2. I want to use the processed data with sklearn/xgboost, I don’t know how they incorporate with dataloaders. How about saving all processed tensors into a big csv, is this more reasonable?

ptrblck · January 21, 2023, 7:18pm

No, I don’t think storing floating point values in csv files would give you a benefit.
Since you want to store the already processed output tensors (I misunderstood the actual use case previously) I guess you might not be able to work around the large storage requirement.
The approach to pick while storing the tensors would depend on the use case you want to apply afterwards. Since you want to use sklearn or xgboost to process the data further, I assume that you would be using numpy arrays and would thus also load the entire dataset at once?
If so, np.savez might work for you.

wasabi · January 27, 2023, 2:26am

Thanks! I have tried np.savez and it seems to be the best option. Is there any good way to sample batches from these saved arrays? I also want to use minibatch algorithms on them, but always sample the same batches will have undesirable inductive bias.

ptrblck · January 27, 2023, 2:29am

To avoid loading the entire numpy array you could use np.memmap which would allow you to load segments from the stored binary file. Is this what you were looking for?

Hadi_Mohseni · March 19, 2024, 6:20pm

Same scenario, I wanted to store a lot features -my dataset is roughly 40GB in size- extracted from wav2vec2 on disk. After trying different approaches, I found using pickle library the best one. Using this, I can simply dump, store, and append extracted features to the rest.

Here is how I tackled this:

import pickle

f = open("path_to_file", "ab")
pickle.dump(number_of_features , f) 
for x in dataloader:
      feature = model(x).cpu().numpy()
      pickle.dump(d, f)

Afterwards, there is a single file containing all features. The reason I store the number of features is to facilitate loading. You can find about it more on:
how-to-get-number-of-objects-in-a-pickle

DentanJeremie · January 3, 2025, 10:18am

Hi, I have been facing a similar situation. The best solution for me was using H5PY, and I advise people in similar situation to use it.

My constraints:

Need to store a great number of tensors (>100k) on a machine where I have a maximum of 150k inodes, so I should not store one file per tensor.
Great volume of data (several Mo per file), for a total of about 1To. So need to save tensors on-the-fly, impossible to save all of them at the end.
Tensors will be processed by a data loader after that, so I need efficient random access performance at reading phase.

Why solutions above did not work (and other popular solution):

np.savez: is fine for saving with string key, but you need to load all the tensors at once so incompatible with a data loader after that.
np.memmap: complicated management of the offset because does not support string key
safetensors: very convenient for the reading phase, but impossible to add tensors on-the-fly unless you hack the library

Why H5PY was the best:

Easy to store with string keys and optimized random access.
Single file, so OK for inode constraints.
Supports half precision
Contrary to LMDB, does not need libffi installation, which may be complicated on cluster with limited rights