Fixing Serialization error: Using numpy tensors for image dataset instead of pytorch tensors?

I found a problem with trying to run a high volume of images on a pipeline that includes pytorch and is supposed to run in a distributed way. The dataset is over a million samples. The pipeline is having trouble serializing and deserializing very large objects. The training dataset (trn_ds in the code) is very large at 2.5 million images. I am using a batch process for processing the images from the data loader, but the data loader does rely upon the training dataset which holds all the image tensors. One thing to try I thought of is to change the .pt (pytorch tensors) to numpy tensors. Numpy tensors are apparently easier to serialize and deserialize. But I’m having trouble finding documentation on how to create numpy image tensors in a dataset and then feed them to a data loader. Is this possible?

You can always convert to/from numpy tensors just before you need torch tensors with the utility functions e.g.,
https://pytorch.org/docs/stable/generated/torch.from_numpy.html
and yourtensor.numpy()

Oh great! One other question – if I train several models on smaller batches of data, can I combine all the small-batch trained models into one final model that is for all intents and purposes trained on the entire dataset?

You might be able to combine the smaller models into an ensemble, but in general I don’t think there is an easy way to get the benefit of training a single model on the entire dataset (especially with any guarantees on the final model’s performance).

Thank you. I agree its a poor solution. Anything else you can suggest for training the 2.5 million image set? Thanks again for your advice.

In practice I don’t think 2.5 million images is unreasonable, as many models are trained on ImageNet (1.2 million images) these days with larger datasets floating around. However, by default ImageNet is usually distributed as an archive that is deserialized as many images on a filesystem so there is no need to keep the entire dataset in memory at once. Is there any way you could do something similar with your dataset (e.g., convert the images to file on disk) ahead of time?

That’s a good idea. Let me see if I can put together a cache method, perhaps in the custom data loader code.

We’re actually going to try DistributedDataParallel and see how that works for us and the caching solution you recommended. Thanks for the input. Its good to know that 2.5 million images is reasonable.