I want to preprocess ImageNet data (and I cannot store everything in memory) and store them as tensors on disk, later I want to load them using one dataloader, I wonder what’s the best strategy for this. There are several candidates in my mind:
store a batch of processed tensors in one file, say one tensor for each class, then I end up with 1000 tensors. This is the ideal one in terms of running time, but I don’t know how to load them later using one dataloader in a good way. I can rewrite DatasetFolder — Torchvision main documentation (pytorch.org), but this only allows me to load a whole tensor as “1” data point, I don’t know how to properly sample data in the usual way.
Save each processed image as one tensor file. This is the easiest to implement, but calling torch.save() too many times is too slow. Is there anyway to optimize?
Save batch of tensors in one file like in (1), but later use TensorDataset to load them individually. I don’t want multiple dataloaders for the downstream tasks though, is there a workaround?
Could you describe your use case a bit more and explain why you want to store all ImageNet images as tensors?
Note that the size would most likely increase (depending on the original image format) while the actual loading time would depend on the bandwidth vs. decoding performance of your system.
E.g. a JPEG image in the shape
[800, 800, 3] uses ~107kB while loading and storing the same data as a tensor uses ~1.9MB.
I’m not storing the actual images, I want to store f(x) where f is a neural net and x is the image. Although the dimension of f is still pretty high (nearly 30000). I tried my approach 1, and after processing a very small amount of classes the storage takes several TB so that’s not doable for me.
The reasons that I want to store the processed tensors is because 1. I think the inference f(x) itself may take too much time, and 2. I want to use the processed data with sklearn/xgboost, I don’t know how they incorporate with dataloaders. How about saving all processed tensors into a big csv, is this more reasonable?
No, I don’t think storing floating point values in csv files would give you a benefit.
Since you want to store the already processed output tensors (I misunderstood the actual use case previously) I guess you might not be able to work around the large storage requirement.
The approach to pick while storing the tensors would depend on the use case you want to apply afterwards. Since you want to use
xgboost to process the data further, I assume that you would be using numpy arrays and would thus also load the entire dataset at once?
np.savez might work for you.
Thanks! I have tried np.savez and it seems to be the best option. Is there any good way to sample batches from these saved arrays? I also want to use minibatch algorithms on them, but always sample the same batches will have undesirable inductive bias.
To avoid loading the entire numpy array you could use
np.memmap which would allow you to load segments from the stored binary file. Is this what you were looking for?