Memory buildup or crash after many batches

Another day, another thread :slight_smile:

I’m still working on getting distributed evaluation running.

This is the code:

Now I’m facing problems which seem to be related to the pin_memory option of the data loader. Also persistent_workers seems to have an influence. The data that is loaded by the dataloader are just randomly generated arrays. No files are opened

Here is what is happening:


Case 1: pin_memory: True , persistent_workers: False

The system is running fine until it has processed like 4000 batches. Then suddenly the system memory starts to accumulate until the system crashes

Case 2: pin_memory: False , persistent_workers: False

After 400-1000 batches I get the error:

RuntimeError: Too many open files. Communication with the workers is no longer possible. Please increase the limit using `ulimit -n` in the shell or change the sharing strategy by calling `torch.multiprocessing.set_sharing_strategy('file_system')` at the beginning of your code

Case 3: pin_memory: False , persistent_workers: True

Same as in case 2

Case 4: pin_memory: False , persistent_workers: True + set_sharing_strategy('file_system')

Same as in case 2


Any idea what might happening?

I’m using pytorch 2.1.0.dev20230719 because of the problems written here.

Best,
Thorsten

I would like to tag someone who is familiar with TorchVolumeDataset - is that a torch component or something you added? (I’m unfamiliar but didn’t find it via a quick search). I’d reroute the question to someone familiar with that if possible.

Its something I’ve added. But for the test I simplified it:

class TorchVolumeDataset(Dataset):
    """Implementation for the volume dataset"""

    def __init__(self, volumes: VolumeDataset):
        self.volumes = volumes

    def __getitem__(self, item_index):
        # vol = self.volumes[item_index]
        # vol = vol.astype(np.float32)
        # vol = pp.norm(vol)
        # vol = vol[np.newaxis]
        #
        vol = np.random.randn(1,37,37,37).astype(np.float32)
        torch_vol = torch.from_numpy(vol)
        input_triplet = {"volume": torch_vol}

        return input_triplet, item_index

    def __len__(self):
        return len(self.volumes)

Setting ulimit -n 65000 solves the problem for case 2.

Case 1 is behaving the same as before, except that the accumulation starts now from the first batch.

But as it is now working for case 2 and not working for case 1, it seems to be clear that it must be bug with pin_memory?

EDIT: As case 2 is now working well enough, I would consider that as solution. However, something seems to be buggy with pin_memory

@ptrblck could you comment on whether pin_memory is being used correctly in case 2? I think on some systems there is a limitation of how much cpu memory you can pin, i.e. not the whole system RAM. I’m not sure if there is some config needed in the DataLoader to ensure that too much pinned memory is not used, or if explicit freeing is needed.

@wconstab I assume you meant case 1 as the others don’t use pinned memory.
Generally, pinned memory should be used carefully, as it will disallow the OS to migrate these pages and you might run into memory thrashing or even instability issues when the OS needs to kill applications before running out of memory.
The description in case 1 sounds a bit weird though. It seems using pin_memory=True works for some time until the host RAM usage suddenly spikes and crashes?
The DataLoader should only pin the memory for the loaded batches (i.e. num_workers, batch_size, and prefetch_factor would be important here).
@thorstenwagner is this issue reproducible with any dataset (e.g. just random tensors in TensorDataset)?

@ptrblck

So far it happened for random tensors and for my data. So I guess yes, its reproducible

BTW: When the memory accumulation starts seems to be a bit random. At the time I opened the thread, it was around batch 4000. Now it starts basically from the beginning.

I also updated to the latest nightly build, which fixed other problems for me, but not this one.