Working with too large dataset

I have data set in folder data.tar.gz. That archive contains two folder with female face images and male face images. Uncompressed size of this data set is 150 gb (archive size 1.9 gb). I work on Google Colab with GDrive, so i couldn’t allocate this huge data set. There is a binary classification task for training. The question is. Is it possible to somehow use this archive in pytorch, despite the fact that there are two folders with data inside archive and how to put a label for this data (for each folder its own label)?

Based on your description you could most likely use torchvision.datasets.ImageFolder to lazily load the data.

1 Like