Training crashes due to - Insufficient shared memory (shm) - nn.DataParallel

Hi all,

I am training an image recognition model with dataset size (4M training images 200x200 size)

Here are the configurations of the training setup:

  1. pytorch v0.4.1
  2. multi-GPU - 4
  3. num_workers of my dataloader = 16
  4. tried pin_memory=true / pin_memory=false
  5. system configuration: 4 Tesla GPUs (6GB each) RAM: 128GB

My training crashes after a few epochs with the error messages like:

ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm)

Here are my system’s shared memory limits:
$ipcs -lm

------ Shared Memory Limits --------
max number of segments = 8192
max seg size (kbytes) = 18014398509465599
max total shared memory (kbytes) = 18446744073642442748
min seg size (bytes) = 1

Any ideas why I am getting shared memory crashes?

Any help is highly appreciated.

Thanks

2 Likes

Are you using a docker container?
If so, you should increase the shared memory for the container as it might be too low.
Have a look at the notes here.

11 Likes

@ptrblck: No I am not using a docker container. I am using a conda installation.

Hi @ptrblck, pytorch users,

I noticed that this behaviour is related to using nn.DataParallel() (multi-GPUs) and num_workers > 1 in torch.utils.data.DataLoader()

After a few epochs, the training crashes with errors like:
" ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm)"

Are there any conflicts / problems when using multiple workers in DataLoader() and nn.DataParallel() ?

Thanks

1 Like

@apaszke, @smth

Hi guys,
Please help with this.

Thanks.

we fixed some errors wrt shared memory and DataLoader on master branch. Maybe try out PyTorch “Preview” build from our website, and see if that fixes it.

Hi @smth
I tried the PyTorch “Preview”, but the issue still remains the same. Could you please do a investigate for the convenience of users.

1 Like

Is this issue is totally fixed? I have this issue on Pytorch 1.0.0.

I have this issue when I set DataLoader num_workers to 32 or more.
I use Pytorch 1.1.0.

More workers might use more shared memory, so you would need to increase the current limit.

I have this issue on Pytorch 1.1.0 too. Any example for increasing the size of shared memory? thanks!

I have this issue on Pytorch 1.1.0 too. Is there any way for users to set the size of shared memory? Thanks!

If you are using ubuntu, you could check the max shared memory size via:

sysctl kernel.shmmax

and set a new value in /etc/sysctl.conf as:

kernel.shmmax=6400000
1 Like

Thanks! When I execute the command:
sysctl kernel.shmmax
The result is:
18446744073692774399
Does that mean the value of shmmax in my system is big enough?

1 Like

It might be bin enough.
Which errors are you seeing that you assume your shared memory is not large enough?

The error message is the same as ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm)
When I set num_works=0, the code can run normall.

1 Like

I get the same situation with next(iter(data_loader)) (My /dev/shm is 256G). Set num_workers=0 indeed can fix this, but num_workers=0 will take more time to load datas, there is an issue of this situation, https://github.com/pytorch/pytorch/issues/13246, but can we have a better solution ?

2 Likes

For me the issue was that I was already converting numpy arrays to torch tensors in the dataloader __getitem__

Numpy arrays should only be converted to torch tensors in the trainer loop, just before being sent to the model. Otherwise the tensors will make the shared memory grow out of bounds.

You can monitor the shared memory by running the command watch -n .3 df -h
The shared memory corresponds to the line /dev/shm
The used amount should not increase after each epoch.

5 Likes

I was always under impression that arrays should be converted to tenors in __getitem__. It’s shown in the tutorial: https://pytorch.org/tutorials/beginner/data_loading_tutorial.html

That would make some sense since some kinds of data cannot be gathered into array until the collate_fn, e.g. text data. Why would they make the memory grow out of bound? I thought that CPU tensors are just wrappers for ndarray.

1 Like

@ptrblck am still facing this error of shared memory is not large enough. I face this issue when I use large models. For example, If I use four resnet50 as a sub-models in a single large model, then I face this issue. However, if I change the four resnet50 to four resnet18 in that same single large model, then I didn’t face this shared memory issue. Is there any way I can increase the shared memory in PyTorch? or do I need to modify the UNIX system? Thanks in advance.