I’m working on a dataset that is too big to fit into RAM. The solution I’m trying currently is to use numpy memmap to load one sample/row at a time using Dataloader. The solution looks something like this:
class MMDataset(torch.utils.data.Dataset):
def __init__(self, path):
self.file_path = path
self.dataset_len = 44000000
self.bytes_per_value = 32/8
self.num_cols = 512
self.num_rows = 1
def __getitem__(self, index):
x = np.memmap(self.file_path, dtype='float32', mode='r', shape=(
self.num_rows, self.num_cols), offset=int(index*self.num_cols*self.bytes_per_value))
return np.array(x)
def __len__(self):
return self.dataset_len
dataset = MMDataset('./data/emb.memmap')
data_loader = DataLoader(
dataset,
batch_size=4096,
shuffle=True,
num_workers=20
)
When the amount of RAM available is greater than the size of the memmap file, the data loading is fast. I get around 60 batches/second. However, when the RAM available is less than the size of the memmap file, I get around 3 batches/second.
I discovered this when trying various sizes for the memmap file.
Why is this the case? If Dataloader + memmap is going to throttle when available RAM < memmap file size, this defeats the point of the solutoin.
I’ve observed that disk i/o is at 500MB/s read constantly when available RAM < memmap file size. This is much higher than the theoretical amount of reading required to load a batch of 4096 samples (closer to 8MB/s).