Training crashes due to - Insufficient shared memory (shm) - nn.DataParallel

dg18 · October 2, 2018, 2:42pm

Hi all,

I am training an image recognition model with dataset size (4M training images 200x200 size)

Here are the configurations of the training setup:

pytorch v0.4.1
multi-GPU - 4
num_workers of my dataloader = 16
tried pin_memory=true / pin_memory=false
system configuration: 4 Tesla GPUs (6GB each) RAM: 128GB

My training crashes after a few epochs with the error messages like:

ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm)

Here are my system’s shared memory limits:
$ipcs -lm

------ Shared Memory Limits --------
max number of segments = 8192
max seg size (kbytes) = 18014398509465599
max total shared memory (kbytes) = 18446744073642442748
min seg size (bytes) = 1

Any ideas why I am getting shared memory crashes?

Any help is highly appreciated.

Thanks

ptrblck · October 3, 2018, 6:12am

Are you using a docker container?
If so, you should increase the shared memory for the container as it might be too low.
Have a look at the notes here.

dg18 · October 3, 2018, 12:53pm

@ptrblck: No I am not using a docker container. I am using a conda installation.

dg18 · October 25, 2018, 3:52pm

Hi @ptrblck, pytorch users,

I noticed that this behaviour is related to using nn.DataParallel() (multi-GPUs) and num_workers > 1 in torch.utils.data.DataLoader()

After a few epochs, the training crashes with errors like:
" ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm)"

Are there any conflicts / problems when using multiple workers in DataLoader() and nn.DataParallel() ?

Thanks

dg18 · October 30, 2018, 4:03pm

@apaszke, @smth

Hi guys,
Please help with this.

Thanks.

smth · October 31, 2018, 1:40am

we fixed some errors wrt shared memory and DataLoader on master branch. Maybe try out PyTorch “Preview” build from our website, and see if that fixes it.

An_Tran · January 25, 2019, 10:54am

Hi @smth
I tried the PyTorch “Preview”, but the issue still remains the same. Could you please do a investigate for the convenience of users.

Noirmist · May 19, 2019, 2:36pm

Is this issue is totally fixed? I have this issue on Pytorch 1.0.0.

PistonY · July 15, 2019, 5:52am

I have this issue when I set DataLoader num_workers to 32 or more.
I use Pytorch 1.1.0.

ptrblck · July 15, 2019, 9:19am

More workers might use more shared memory, so you would need to increase the current limit.

Andybert · August 4, 2019, 1:48am

I have this issue on Pytorch 1.1.0 too. Any example for increasing the size of shared memory? thanks!

Andybert · August 4, 2019, 1:50am

I have this issue on Pytorch 1.1.0 too. Is there any way for users to set the size of shared memory? Thanks!

ptrblck · August 4, 2019, 9:33am

If you are using ubuntu, you could check the max shared memory size via:

sysctl kernel.shmmax

and set a new value in /etc/sysctl.conf as:

kernel.shmmax=6400000

Andybert · August 4, 2019, 9:49am

Thanks! When I execute the command:
sysctl kernel.shmmax
The result is:
18446744073692774399
Does that mean the value of shmmax in my system is big enough?

ptrblck · August 4, 2019, 9:53am

It might be bin enough.
Which errors are you seeing that you assume your shared memory is not large enough?

Andybert · August 4, 2019, 10:50am

The error message is the same as ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm)
When I set num_works=0, the code can run normall.

ggjy · August 21, 2019, 2:40am

I get the same situation with next(iter(data_loader)) (My /dev/shm is 256G). Set num_workers=0 indeed can fix this, but num_workers=0 will take more time to load datas, there is an issue of this situation, https://github.com/pytorch/pytorch/issues/13246, but can we have a better solution ?

gabeur · February 21, 2020, 12:50am

For me the issue was that I was already converting numpy arrays to torch tensors in the dataloader __getitem__

Numpy arrays should only be converted to torch tensors in the trainer loop, just before being sent to the model. Otherwise the tensors will make the shared memory grow out of bounds.

You can monitor the shared memory by running the command watch -n .3 df -h
The shared memory corresponds to the line /dev/shm
The used amount should not increase after each epoch.

pkubik · May 13, 2020, 10:17pm

I was always under impression that arrays should be converted to tenors in __getitem__. It’s shown in the tutorial: https://pytorch.org/tutorials/beginner/data_loading_tutorial.html

That would make some sense since some kinds of data cannot be gathered into array until the collate_fn, e.g. text data. Why would they make the memory grow out of bound? I thought that CPU tensors are just wrappers for ndarray.

akashs · June 9, 2020, 1:03am

@ptrblck am still facing this error of shared memory is not large enough. I face this issue when I use large models. For example, If I use four resnet50 as a sub-models in a single large model, then I face this issue. However, if I change the four resnet50 to four resnet18 in that same single large model, then I didn’t face this shared memory issue. Is there any way I can increase the shared memory in PyTorch? or do I need to modify the UNIX system? Thanks in advance.