I have also been getting this bus error despite df -h only shows 1% of shared memory used.
It seems to work when I change multiprocessing.set_sharing_strategy to ‘file_system’ instead of the default (‘file_descriptor’). I would have expected that the file descriptor strategy would cause a “too many open files error” and not a bus error.
Is it possible that too many open files causes a bus error?
This solved my problem, use tensors / numpy arrays instead of lists.
Hi,
In my case, I had a (shared) memory leak due to torch.multiprocessing.set_sharing_strategy('file_system')
.
After setting the sharing strategy to file_descriptor
via
torch.multiprocessing.set_sharing_strategy('file_descriptor')
, the problem went away.
I got a hint here: Multiprocessing package - torch.multiprocessing — PyTorch 2.0 documentation
Hope this helps anyone.
Can someone give explanation to this ??? yes i am converting numpy to tensor in collate fn but i am not sure if this is the case and if it is can someone please give explanation to it . I am facing the same problem as mentioned in the discussion . my shm size was 8 gb i had this problem for the very first epoch in validation loop, training loop ran fine with num of workers = 16 in both train and val dataloaders .