Efficient way to make datasets?

tymokvo · March 31, 2017, 5:15pm

So I’m using a script to turn a directory of images in 5 subdirectories into a single tensor of size (730, 3, 256, 256) and a label tensor of size (730, 5) for 5 classes and then torch.utils.data to turn that into a TensorDataset and make/shuffle batches. The batches are then moved to the GPU individually at each iteration through the dataset during training.

However, this isn’t a tenable practice for a very large dataset. Is there a better way to do this that I’m not seeing in the docs? It seems like there should be a simpler way to read images from disk into shuffled batches rather than having to put the whole thing into two tensors in system memory.

fmassa · March 31, 2017, 6:33pm

You can use the ImageFolder dataset from torchvision for that. It follows the structure of imagenet.

tymokvo · March 31, 2017, 9:15pm

Thanks! That’s pretty handy