Move whole Image Dataset to Memory (RAM/VRAM) at once

TTeuZ · May 1, 2024, 8:05pm

Hello guys! I’m fine tuning an Mobilenet_V3 to my use case, and in order to do that I’m using 2 datasets to perform a cross testing validation.

Until now I was using my personal machine to perform the training/validation/testing process, and in order to do it I was using a combination of data prefetcher and fast collate functions in my DataLoader to expedite the image collection from my SSD on the fly.

Long story short my past setup is:

Nvidia 1660 Super 6GB VRAM
CPU with 8 Cores
16 GB RAM

And the Data load process consist of:

One big ConcatDataset of ImageToFolder datasets (getting from the SSD on the fly)
DataLoaders with this config:

train_loader = torch.utils.data.DataLoader(train_ds, batch_size=64, shuffle=True, num_workers=6, collate_fn=collate_fn)
(val/train)_loader = torch.utils.data.DataLoader((val/test)_ds, batch_size=1000 shuffle=False, num_workers=6, collate_fn=collate_fn)

# train_ds, val_ds and test_ds are all ConcatDasets with ImageToFolder datasets.
# The fast collate function was responsible to convert the image to tensors
# The only transformation applied to the images with transforms was an resize to (128, 128), the transformation to tensors and normalization is made by the collate_fn and data_prefetcher

Wrapping the DataLoader I used a handmade data prefetcher that I found in one “random” post here, that was responsible for fast data loading into my GPU.

Obs: I can send the code here in the future if it’s necessary.

Long story short again, with this config I was able to extract 100% of use from my GPU, and the train/val/testing process took ~16Hrs (the training process is repeated 10 times to get a cross testing average).

Now, moving to my current situation:

Happily, I was blessed with an powerful server to run my experiments, and the setup is:

Two A5000 with 24GB VRAM each
CPU with 48 Cores
125GB RAM

But, unfortunately, there is a huge problem, the file storage. The SSDs/HDs there are pretty slow and it became a huge bottleneck in my training process. Basically, with the same data load process that I mentioned before, I was able to extract ~40% of the GPU power, and the training/val/testing process took the same amount of time.

Talking with colleagues, I was told about the possibility to load all images to RAM and use it from there, to avoid load from the SDD on the fly, and this is the topic of my question. Anyone know a good way to do it?

I already tried to load all images in a python list, using cv2.imread() or PIL.image.read() to load each image and with this list (stored in ram) create a custom dataset like this one:

from torch.utils.data import Dataset

class memory_dataset(Dataset):
    def __init__(self, data, targets, transform):
        self.data = data
        self.targets = targets
        self.transform = transform


    def __len__(self):
        return len(self.data)


    def __getitem__(self, index):
        img = self.transform(self.data[index])
        target = self.targets[index]
        return img, target

But the process to read each image with openCV or PIL took hours (even with multiprocessing, using all cores to load different images batches at once) and the performance gain wasn’t the best.

If more information is necessary, feel free to ask, I’m already stuck with this problem for a couple of days ahaha.