My model trains with input/target data stored in numpy arrays. I am working on scaling up to larger training sets and have read that a memory mapped numpy array allows working with arrays larger than available RAM. My end goal is to move this into a TPU environment where I will have less available RAM/core than I have even on my local machine CPU RAM.
What I’m finding in my initial testing is that each data loader worker process seems to end up with a RAM footprint that is the full size of the training dataset. (~7GB) I thought this might have to do with inefficient shuffling approach (requiring all pages accessed or something) but I see the same with shuffle turned off in the dataloader. (Note there are about 28,500 samples/rows in the numpy arrays)
Would appreciate any pointers on where I went wrong with the setup below, thanks. And if the answer differs between this local/CPU set and what would be needed for multi-processing on Colab/GCP TPU setup please let me know. I’m also open to other approaches besides memory mapped arrays. I tried webdataset but the tar file size gets unworkable fast.
class El80Dataset(Dataset): def __init__(self, x_data_path, y_data_path): #define constant(s) for raw to model transforms self.my_mult = 1.17 # self.mmapped acts like a numpy array self.x_data = np.load(x_data_path, mmap_mode='r+') # loading the labels self.y_data = np.load(y_data_path, mmap_mode='r+') def __len__(self): return self.x_data.shape def __getitem__(self, idx): x_samp = self.my_mult * torch.tensor(self.x_data[idx,None,:]) y_samp = self.my_mult * torch.tensor(self.y_data[idx]) #log transform data x_samp = torch.log(x_samp + 1) y_samp = torch.log(y_samp + 1) sample = (x_samp, y_samp) return sample
#Create train and test data loaders with memory-mapped npy files x_data_path = datapath / 'training/x_data.npy' y_data_path = datapath / 'training/y_data.npy' spec_ds = El80Dataset(x_data_path, y_data_path) #batch sizes train_bs = 100 test_bs = 200 #create random split for training and validation train_len = int(0.8 * len(spec_ds)) test_len = len(spec_ds) - train_len train_ds, test_ds = random_split(spec_ds,[train_len, test_len]) train_dl = DataLoader(train_ds, batch_size=train_bs, shuffle=True, num_workers=5) test_dl = DataLoader(test_ds, batch_size=test_bs, num_workers=5) spec_ds = None