Training crashes due to - Insufficient shared memory (shm) - nn.DataParallel

I have also been getting this bus error despite df -h only shows 1% of shared memory used.
It seems to work when I change multiprocessing.set_sharing_strategy to ‘file_system’ instead of the default (‘file_descriptor’). I would have expected that the file descriptor strategy would cause a “too many open files error” and not a bus error.
Is it possible that too many open files causes a bus error?

This solved my problem, use tensors / numpy arrays instead of lists.

Hi,

In my case, I had a (shared) memory leak due to torch.multiprocessing.set_sharing_strategy('file_system').

After setting the sharing strategy to file_descriptor via
torch.multiprocessing.set_sharing_strategy('file_descriptor'), the problem went away.

I got a hint here: Multiprocessing package - torch.multiprocessing — PyTorch 2.0 documentation

Hope this helps anyone.

Can someone give explanation to this ??? yes i am converting numpy to tensor in collate fn but i am not sure if this is the case and if it is can someone please give explanation to it . I am facing the same problem as mentioned in the discussion . my shm size was 8 gb i had this problem for the very first epoch in validation loop, training loop ran fine with num of workers = 16 in both train and val dataloaders .