Efficient way to make datasets?

So I’m using a script to turn a directory of images in 5 subdirectories into a single tensor of size (730, 3, 256, 256) and a label tensor of size (730, 5) for 5 classes and then torch.utils.data to turn that into a TensorDataset and make/shuffle batches. The batches are then moved to the GPU individually at each iteration through the dataset during training.

However, this isn’t a tenable practice for a very large dataset. Is there a better way to do this that I’m not seeing in the docs? It seems like there should be a simpler way to read images from disk into shuffled batches rather than having to put the whole thing into two tensors in system memory.

You can use the ImageFolder dataset from torchvision for that. It follows the structure of imagenet.

Thanks! That’s pretty handy