I have a dataset of over 12mil. images of size 180x180 (RGB), approx. 85GB of size. I’m training an image classifier on an AWS p2.8xlarge instance (8GPU, 32CPU, 488GB RAM).

I have a custom Dataset object, which is basically an ImageFolder with the exception that I’m trying to load images in advance. I.e. I only change lines 41-42 here: like this:

path = os.path.join(root, fname)
img = pil_loader(path)
item = (img, class_to_idx[target])

and then remove the loading (line 122) from the __getitem__ method after.

Yet when I run the classifier, it’s somewhat inefficient. After loading roughly 10% of the dataset, it uses over 100GB of RAM. Any idea on what the problem may be or what should I focus on?

I’m grateful for any ideas. Thank you in advance.

It looks,
10% of your dataset == 1.2M 180x180 RGB images == 116,640MB is needed even if your image is in uint8 format.
Does the 85GB mean the total size of the files or in memory?

Thank you.

Yes, 85GB is the total size of files in storage (in .jpg). The training set is circa 9 800 000 images, float32. I did the math, now I see :smiley: