DataLoader worker RAM at full DataSet size with Mem Map .npy

builder173 · November 9, 2022, 11:47pm

My model trains with input/target data stored in numpy arrays. I am working on scaling up to larger training sets and have read that a memory mapped numpy array allows working with arrays larger than available RAM. My end goal is to move this into a TPU environment where I will have less available RAM/core than I have even on my local machine CPU RAM.

What I’m finding in my initial testing is that each data loader worker process seems to end up with a RAM footprint that is the full size of the training dataset. (~7GB) I thought this might have to do with inefficient shuffling approach (requiring all pages accessed or something) but I see the same with shuffle turned off in the dataloader. (Note there are about 28,500 samples/rows in the numpy arrays)

Would appreciate any pointers on where I went wrong with the setup below, thanks. And if the answer differs between this local/CPU set and what would be needed for multi-processing on Colab/GCP TPU setup please let me know. I’m also open to other approaches besides memory mapped arrays. I tried webdataset but the tar file size gets unworkable fast.

class El80Dataset(Dataset):
    def __init__(self, x_data_path, y_data_path):
        #define constant(s) for raw to model transforms
        self.my_mult = 1.17
        # self.mmapped acts like a numpy array
        self.x_data = np.load(x_data_path, mmap_mode='r+')
        # loading the labels
        self.y_data = np.load(y_data_path, mmap_mode='r+')

    def __len__(self):
        return self.x_data.shape[0]

    def __getitem__(self, idx):
        x_samp = self.my_mult * torch.tensor(self.x_data[idx,None,:])
        y_samp = self.my_mult * torch.tensor(self.y_data[idx])
        #log transform data
        x_samp = torch.log(x_samp + 1)
        y_samp = torch.log(y_samp + 1)
        sample = (x_samp, y_samp)
        return sample

#Create train and test data loaders with memory-mapped npy files
x_data_path = datapath / 'training/x_data.npy'
y_data_path = datapath / 'training/y_data.npy'
spec_ds = El80Dataset(x_data_path, y_data_path)

#batch sizes
train_bs = 100
test_bs = 200

#create random split for training and validation
train_len = int(0.8 * len(spec_ds))
test_len = len(spec_ds) - train_len
train_ds, test_ds = random_split(spec_ds,[train_len, test_len])
train_dl = DataLoader(train_ds, batch_size=train_bs, shuffle=True, num_workers=5)
test_dl = DataLoader(test_ds, batch_size=test_bs, num_workers=5)
spec_ds = None

Adrien_Forbu · April 19, 2023, 12:48pm

I have exactly the same issue.

Adrien_Forbu · April 19, 2023, 1:34pm

Ok I found the issue and solution … Basicly the dataloder when sampling element from the dataset don’t replace its sampling element (because when you train for a full epoch you don’t want to retake the same elements two times for exemple, you want to pass though the integrality of your data).
In the dataloader section their is

if self.replacement:
    for _ in range(self.num_samples // 32):
        yield from torch.randint(high=n, size=(32,), dtype=torch.int64, generator=generator).tolist()
    yield from torch.randint(high=n, size=(self.num_samples % 32,), dtype=torch.int64, generator=generator).tolist()
else:
    for _ in range(self.num_samples // n):
        yield from torch.randperm(n, generator=generator).tolist()
    yield from torch.randperm(n, generator=generator).tolist()[:self.num_samples % n]

The issue is that you have too much element to shuffle … so there is a memory explosion.
What you need to do is this :

rsampler = RandomSampler(dataset, replacement=True, generator=torch.Generator())
batch_sampler = BatchSampler(rsampler, batch_size = 4, drop_last=True)
dataloader = DataLoader(dataset, num_workers=0, batch_sampler=batch_sampler)

Laubeee · June 7, 2024, 4:41pm

I experience a similar issue.

Could you elaborate why there is memory explosion? randperm() with n=28500 should not be an issue at all?

Hostred · June 8, 2024, 1:52pm

This is informative, thanks for sharing.

Hostred · June 12, 2024, 8:39am

This is informative, thanks for sharing..