Training crashes due to - Insufficient shared memory (shm) - nn.DataParallel

peterbjorgensen · February 17, 2023, 10:16am

I have also been getting this bus error despite df -h only shows 1% of shared memory used.
It seems to work when I change multiprocessing.set_sharing_strategy to ‘file_system’ instead of the default (‘file_descriptor’). I would have expected that the file descriptor strategy would cause a “too many open files error” and not a bus error.
Is it possible that too many open files causes a bus error?

Edgarcillo2k · June 7, 2023, 1:59pm

This solved my problem, use tensors / numpy arrays instead of lists.

yz93 · August 15, 2023, 6:15am

Hi,

In my case, I had a (shared) memory leak due to torch.multiprocessing.set_sharing_strategy('file_system').

After setting the sharing strategy to file_descriptor via
torch.multiprocessing.set_sharing_strategy('file_descriptor'), the problem went away.

I got a hint here: Multiprocessing package - torch.multiprocessing — PyTorch 2.0 documentation

Hope this helps anyone.

Trooper_OG · December 19, 2023, 11:00am

Can someone give explanation to this ??? yes i am converting numpy to tensor in collate fn but i am not sure if this is the case and if it is can someone please give explanation to it . I am facing the same problem as mentioned in the discussion . my shm size was 8 gb i had this problem for the very first epoch in validation loop, training loop ran fine with num of workers = 16 in both train and val dataloaders .

drscotthawley · May 9, 2024, 2:07am

I think this thread neglects to address a key user concern: WHY are my workers consuming so much shared memory?? If I have 256 GB of shared memory and 12 workers, that implies that each worker is consuming around 20 GB. That’s roughly the size of my dataset.

In other words, it’s as if each worker had an entire copy of my dataset. Even though there’s only one Dataset object which the workers should be accessing.

Can any PyTorch experts offer any insights as to WHY this might be occuring and how to fix it? Not so much “You need to increase your shared memory (and it’s up to you to figure out how to do that)”
… But more like, “Huh I agree that’s really unexpected behavior. The workers shouldn’t be doing that. Make sure you [do this] or don’t [do that]…to avoid strangely consuming tons of shared memory with your PyTorch workers.” Thanks.

@ptrblck

drscotthawley · May 9, 2024, 2:32am

…seems that the DataLoader workers, rather than each accessing the shared memory of the Dataset, are instead copying the Dataset over & over to each worker.

If you only ever lazy-load your data via __getitem__ you probably won’t notice this, but for very slow-to-load files where it’s preferable to preload them into persistent memory (for the Dataset, not the workers), then this copying-to-every-worker behavior of the DataLoader (unexpected for some of us) is what causes the shared memory overrun.

Is there way to set a worker method (‘fork’, ‘spawn’, etc?) so that it will keep the Dataset memory shared between workers instead of copying it? Thanks.

miotto · September 12, 2024, 12:36pm

Setting pin_memory = True solved the shared memory problem for me.