What is the best way to load a large numpy array as PyTorch dataset?

I want to use PyTorch to train a ConvNet. But my data is one single large numpy .npy file, with shape like (100000, 5, 200, 200), rather than traditional images. The first dimension is sample size. The next three dimensions can be regarded as the channels, width and height of images.

Is there any “standard” way to load large training/test dataset into PyTorch, especially when the input is a large .npy file? And do I need to preprocess the data and save it as another file? Thank you!

I could use mmap_mode='r' to load it lazily. But if I want to preprocess the data, such as normalization, training/test set split, I still need to load all data into memory.

1 Like