Numpy memmap throttles with Dataloader when available RAM less than file size

I’m working on a dataset that is too big to fit into RAM. The solution I’m trying currently is to use numpy memmap to load one sample/row at a time using Dataloader. The solution looks something like this:

class MMDataset(torch.utils.data.Dataset):
    def __init__(self, path):
        self.file_path = path
        self.dataset_len = 44000000
        self.bytes_per_value = 32/8
        self.num_cols = 512
        self.num_rows = 1


    def __getitem__(self, index):
        

        x = np.memmap(self.file_path, dtype='float32', mode='r', shape=(
            self.num_rows, self.num_cols), offset=int(index*self.num_cols*self.bytes_per_value))

        return np.array(x)

    def __len__(self):
        return self.dataset_len



dataset = MMDataset('./data/emb.memmap')

data_loader = DataLoader(
    dataset,
    batch_size=4096,
    shuffle=True,
    num_workers=20
)

When the amount of RAM available is greater than the size of the memmap file, the data loading is fast. I get around 60 batches/second. However, when the RAM available is less than the size of the memmap file, I get around 3 batches/second.

I discovered this when trying various sizes for the memmap file.

Why is this the case? If Dataloader + memmap is going to throttle when available RAM < memmap file size, this defeats the point of the solutoin.

I’ve observed that disk i/o is at 500MB/s read constantly when available RAM < memmap file size. This is much higher than the theoretical amount of reading required to load a batch of 4096 samples (closer to 8MB/s).

Can you check everything with shuffling off?
Think that you have 44M samples. It means there will exist a vector of length 44M which you will have to shuffle.

This seems to solve it! Do you mind explaining a little why shufle=True made it slower?

Don’t I still need to shuffle the dataset to train my model? How would I work around this?

Actually I think the problem still isn’t solved. I think the reading is quick because reading in order allows us to cache blocks of contiguous rows. If I manually reindex to random indices, it’s slow again.

Hi, sorry for the late replay.
You are not using mmap “properly”.
mmap should be defined in the init functionn (not in getittem).
Once you define it you can treat it as a normal array.
When you try to retrieve the data it will read it from disk.
Defining the mmap instance in getitem makes no sense to me.

The problem with shuffling is it takes lot of memory to generate indices. Besides, you will encounter this problem of reading contiguous blocks of memory. It’s indeed way faster than reading separate blocks.