Summary: Attempting to load 42GB images all in RAM before training. While loading it occupies more than 256GB RAM.
I have training, test and dev data totaling about 42 GB.
The folders are divided based on this format.
/data
/train
/video_name
-frame0001.png
.
.
-frame0132.png
/test
/video_name
/dev
/video_name
The video_name and other metadata associated with it is stored in 3 csv files (train, test, dev).
I have a cluster with 256GB physical RAM. I am trying to load all the frames to RAM before starting the training. This is because I am attempting to run a large number of epochs and I have (thought so) enough RAM so why not
Here’s the skeleton ImageDataSet that I use to load all images from a given video_name folder:
from skimage import io
import glob
class ImageDataset(torch.utils.data.Dataset):
def __init__(
self,
root_dir: str,
file_type: str,
transform=None):
"""
:param root_dir: Directory with images
:param file_type: png
:param transform: Optional transform to be applied on a sample (frame)
"""
super(ImageDataset, self).__init__()
self.root_dir = root_dir
self.file_type = file_type
# identify all frames with provided file extension type
self.frames = sorted(glob.glob(self.root_dir + '/*' + self.file_type))
self.transform = transform
def __len__(self):
return len(self.frames)
def __getitem__(self, idx):
img_name = self.frames[idx]
# read frame
image = io.imread(img_name)
# Scale to 0-1
image = image / 255.0
sample = {'image': image}
if self.transform:
sample = self.transform(sample)
return sample
I call this class N times, with N being number of video folders in test or dev or test folders.
My understanding is this should take ~42GB space in RAM. However, as the data is loaded, it takes more than 256GB and the python3 process was killed by Linux. I monitored the RAM usage using top
command, and I could see the memory usage slowly increasing and eventually exceeding 256GB. This is before any training occurs.
Any idea to debug this is appreciated.