My model trains with input/target data stored in numpy arrays. I am working on scaling up to larger training sets and have read that a memory mapped numpy array allows working with arrays larger than available RAM. My end goal is to move this into a TPU environment where I will have less available RAM/core than I have even on my local machine CPU RAM.
What I’m finding in my initial testing is that each data loader worker process seems to end up with a RAM footprint that is the full size of the training dataset. (~7GB) I thought this might have to do with inefficient shuffling approach (requiring all pages accessed or something) but I see the same with shuffle turned off in the dataloader. (Note there are about 28,500 samples/rows in the numpy arrays)
Would appreciate any pointers on where I went wrong with the setup below, thanks. And if the answer differs between this local/CPU set and what would be needed for multi-processing on Colab/GCP TPU setup please let me know. I’m also open to other approaches besides memory mapped arrays. I tried webdataset but the tar file size gets unworkable fast.
class El80Dataset(Dataset):
def __init__(self, x_data_path, y_data_path):
#define constant(s) for raw to model transforms
self.my_mult = 1.17
# self.mmapped acts like a numpy array
self.x_data = np.load(x_data_path, mmap_mode='r+')
# loading the labels
self.y_data = np.load(y_data_path, mmap_mode='r+')
def __len__(self):
return self.x_data.shape[0]
def __getitem__(self, idx):
x_samp = self.my_mult * torch.tensor(self.x_data[idx,None,:])
y_samp = self.my_mult * torch.tensor(self.y_data[idx])
#log transform data
x_samp = torch.log(x_samp + 1)
y_samp = torch.log(y_samp + 1)
sample = (x_samp, y_samp)
return sample
#Create train and test data loaders with memory-mapped npy files
x_data_path = datapath / 'training/x_data.npy'
y_data_path = datapath / 'training/y_data.npy'
spec_ds = El80Dataset(x_data_path, y_data_path)
#batch sizes
train_bs = 100
test_bs = 200
#create random split for training and validation
train_len = int(0.8 * len(spec_ds))
test_len = len(spec_ds) - train_len
train_ds, test_ds = random_split(spec_ds,[train_len, test_len])
train_dl = DataLoader(train_ds, batch_size=train_bs, shuffle=True, num_workers=5)
test_dl = DataLoader(test_ds, batch_size=test_bs, num_workers=5)
spec_ds = None