Dataset memory conventions

seedship · June 6, 2020, 10:00pm

Let me start off this post by saying I am a newbie with deep learning and pytorch.

I am trying to train a DenseNet instance on a dataset of around 2500 images. Currently, I am parsing them from my disk and loading them into main memory. I have 64GB of main memory and 8GB of GPU memory. I figure since main memory is cheap, it would be faster to load them into main memory at constructin time, then perform an SSD access operation when getitem is called. However, when I try loading 1 batch, the GPU suddenly allocates around 5-6GB of memory and I run out of GPU memory. When I instantiate my dataset object, I also realize my main memory usage goes up around 6-7 GB. Thus, I have the following questions.

What is the common convention when making a dataset based on folders of images? Do you only create the tensor in __getitem__, even if the dataset fits in main memory?
When I have a large 6-7 GB tensor, get a reference to a small subset of it, and call .to(cuda), will the entire tensor be copied to the GPU?

EDIT: I have just modified my dataset to only load images into memory when __getitem__ is called, but I still have large GPU memory usage. I checked line by line with the debugger, and the memory usage spikes when I call model(x). Is DenseNet just a memory hog after an input is passed through?

class TrashNetDataset(torch.utils.data.Dataset):
    def __init__(self, basedir: str = ""):
        self.basedir = basedir
        self.glass_list = [(x, GLASS) for x in self.parseImages("glass/*")]
        self.paper_list = [(x, PAPER) for x in self.parseImages("paper/*")]
        self.cardboard_list = [(x, CARDBOARD) for x in self.parseImages("cardboard/*")]
        self.plastic_list = [(x, PLASTIC) for x in self.parseImages("plastic/*")]
        self.metal_list = [(x, METAL) for x in self.parseImages("metal/*")]
        self.trash_list = [(x, TRASH) for x in self.parseImages("trash/*")]

        self.image_list = self.glass_list + self.paper_list + self.cardboard_list + self.plastic_list + self.metal_list + self.trash_list

        self.data_len = len(self.image_list)

    def __len__(self):
        return self.data_len

    def __getitem__(self, index):
        return torch.Tensor(self.image_list[index][0]), self.image_list[index][1]

    def parseImages(self, path: str):
        return [numpy.asarray(Image.open(x))/255.0 for x in glob.glob(join(self.basedir, path))]

ptrblck · June 8, 2020, 6:34am

Depends on the use case and how the data is stored. Usually I load the data lazily as it also allows for faster debugging (startup time is lower).
The entire tensor should not be copied to the device.

That might be the case. I’m not sure which GPUs are provided in Google Colab, but you might try to check the memory footprint there (if they provide devices with more than 8GB).