I’m still working on getting distributed evaluation running.
This is the code:
Now I’m facing problems which seem to be related to the pin_memory option of the data loader. Also persistent_workers seems to have an influence. The data that is loaded by the dataloader are just randomly generated arrays. No files are opened
Here is what is happening:
Case 1:pin_memory: True , persistent_workers: False
The system is running fine until it has processed like 4000 batches. Then suddenly the system memory starts to accumulate until the system crashes
Case 2:pin_memory: False , persistent_workers: False
After 400-1000 batches I get the error:
RuntimeError: Too many open files. Communication with the workers is no longer possible. Please increase the limit using `ulimit -n` in the shell or change the sharing strategy by calling `torch.multiprocessing.set_sharing_strategy('file_system')` at the beginning of your code
Case 3:pin_memory: False , persistent_workers: True
Same as in case 2
Case 4:pin_memory: False , persistent_workers: True + set_sharing_strategy('file_system')
Same as in case 2
Any idea what might happening?
I’m using pytorch 2.1.0.dev20230719 because of the problems written here.
I would like to tag someone who is familiar with TorchVolumeDataset - is that a torch component or something you added? (I’m unfamiliar but didn’t find it via a quick search). I’d reroute the question to someone familiar with that if possible.
@ptrblck could you comment on whether pin_memory is being used correctly in case 2? I think on some systems there is a limitation of how much cpu memory you can pin, i.e. not the whole system RAM. I’m not sure if there is some config needed in the DataLoader to ensure that too much pinned memory is not used, or if explicit freeing is needed.
@wconstab I assume you meant case 1 as the others don’t use pinned memory.
Generally, pinned memory should be used carefully, as it will disallow the OS to migrate these pages and you might run into memory thrashing or even instability issues when the OS needs to kill applications before running out of memory.
The description in case 1 sounds a bit weird though. It seems using pin_memory=True works for some time until the host RAM usage suddenly spikes and crashes?
The DataLoader should only pin the memory for the loaded batches (i.e. num_workers, batch_size, and prefetch_factor would be important here). @thorstenwagner is this issue reproducible with any dataset (e.g. just random tensors in TensorDataset)?
So far it happened for random tensors and for my data. So I guess yes, its reproducible
BTW: When the memory accumulation starts seems to be a bit random. At the time I opened the thread, it was around batch 4000. Now it starts basically from the beginning.
I also updated to the latest nightly build, which fixed other problems for me, but not this one.