DataLoader worker RAM at full DataSet size with Mem Map .npy

My model trains with input/target data stored in numpy arrays. I am working on scaling up to larger training sets and have read that a memory mapped numpy array allows working with arrays larger than available RAM. My end goal is to move this into a TPU environment where I will have less available RAM/core than I have even on my local machine CPU RAM.

What I’m finding in my initial testing is that each data loader worker process seems to end up with a RAM footprint that is the full size of the training dataset. (~7GB) I thought this might have to do with inefficient shuffling approach (requiring all pages accessed or something) but I see the same with shuffle turned off in the dataloader. (Note there are about 28,500 samples/rows in the numpy arrays)

Would appreciate any pointers on where I went wrong with the setup below, thanks. And if the answer differs between this local/CPU set and what would be needed for multi-processing on Colab/GCP TPU setup please let me know. I’m also open to other approaches besides memory mapped arrays. I tried webdataset but the tar file size gets unworkable fast.

class El80Dataset(Dataset):
    def __init__(self, x_data_path, y_data_path):
        #define constant(s) for raw to model transforms
        self.my_mult = 1.17
        # self.mmapped acts like a numpy array
        self.x_data = np.load(x_data_path, mmap_mode='r+')
        # loading the labels
        self.y_data = np.load(y_data_path, mmap_mode='r+')

    def __len__(self):
        return self.x_data.shape[0]

    def __getitem__(self, idx):
        x_samp = self.my_mult * torch.tensor(self.x_data[idx,None,:])
        y_samp = self.my_mult * torch.tensor(self.y_data[idx])
        #log transform data
        x_samp = torch.log(x_samp + 1)
        y_samp = torch.log(y_samp + 1)
        sample = (x_samp, y_samp)
        return sample
#Create train and test data loaders with memory-mapped npy files
x_data_path = datapath / 'training/x_data.npy'
y_data_path = datapath / 'training/y_data.npy'
spec_ds = El80Dataset(x_data_path, y_data_path)

#batch sizes
train_bs = 100
test_bs = 200

#create random split for training and validation
train_len = int(0.8 * len(spec_ds))
test_len = len(spec_ds) - train_len
train_ds, test_ds = random_split(spec_ds,[train_len, test_len])
train_dl = DataLoader(train_ds, batch_size=train_bs, shuffle=True, num_workers=5)
test_dl = DataLoader(test_ds, batch_size=test_bs, num_workers=5)
spec_ds = None