How to share big data in Dataset across processes?

dancedpipi · December 20, 2023, 7:14am

I have a Dataset:

class MyDataset(Dataset):
    def __init__(self, data):
        self.data = data

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        return self.data[idx]

The data is a big numpy tensor(almost 27GB), so larger dataloader’s num_workers will cause out of memory.
It is worth noting that the data.dtype is object, so numpy memmap mode cannot be used directly.

In order to share across processes, I modified it to the following Dataset:

from multiprocessing import shared_memory
from filelock import FileLock

class MyDataset(Dataset):
    def __init__(self, data):
        data_size = data.nbytes
        lock = FileLock("./create_shared_mem.lock")
        with lock:
            try:
                shm = shared_memory.SharedMemory(name="data", create=True, size=data_size)
                shm_data = np.ndarray(data.shape, dtype=data.dtype, buffer=shm.buf)
                shm_data[:] = data[:]
            except FileExistsError as e:
                shm = shared_memory.SharedMemory(name="data")
                shm_data = np.ndarray(data.shape, dtype=data.dtype, buffer=shm.buf)

        self.data = shm_data

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        return self.data[idx]

But such code implementation will encounter some strange problems.
I want to know what is the best practice for multi-process sharing tensor in torch.