I am training an image recognition model with dataset size (4M training images 200x200 size)
Here are the configurations of the training setup:
pytorch v0.4.1
multi-GPU - 4
num_workers of my dataloader = 16
tried pin_memory=true / pin_memory=false
system configuration: 4 Tesla GPUs (6GB each) RAM: 128GB
My training crashes after a few epochs with the error messages like:
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm)
Here are my system’s shared memory limits:
$ipcs -lm
------ Shared Memory Limits --------
max number of segments = 8192
max seg size (kbytes) = 18014398509465599
max total shared memory (kbytes) = 18446744073642442748
min seg size (bytes) = 1
Are you using a docker container?
If so, you should increase the shared memory for the container as it might be too low.
Have a look at the notes here.
I noticed that this behaviour is related to using nn.DataParallel() (multi-GPUs) and num_workers > 1 in torch.utils.data.DataLoader()
After a few epochs, the training crashes with errors like:
" ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm)"
Are there any conflicts / problems when using multiple workers in DataLoader() and nn.DataParallel() ?
we fixed some errors wrt shared memory and DataLoader on master branch. Maybe try out PyTorch “Preview” build from our website, and see if that fixes it.
Thanks! When I execute the command:
sysctl kernel.shmmax
The result is:
18446744073692774399
Does that mean the value of shmmax in my system is big enough?
I get the same situation with next(iter(data_loader)) (My /dev/shm is 256G). Set num_workers=0 indeed can fix this, but num_workers=0 will take more time to load datas, there is an issue of this situation, https://github.com/pytorch/pytorch/issues/13246, but can we have a better solution ?
For me the issue was that I was already converting numpy arrays to torch tensors in the dataloader __getitem__
Numpy arrays should only be converted to torch tensors in the trainer loop, just before being sent to the model. Otherwise the tensors will make the shared memory grow out of bounds.
You can monitor the shared memory by running the command watch -n .3 df -h
The shared memory corresponds to the line /dev/shm
The used amount should not increase after each epoch.
That would make some sense since some kinds of data cannot be gathered into array until the collate_fn, e.g. text data. Why would they make the memory grow out of bound? I thought that CPU tensors are just wrappers for ndarray.
@ptrblck am still facing this error of shared memory is not large enough. I face this issue when I use large models. For example, If I use four resnet50 as a sub-models in a single large model, then I face this issue. However, if I change the four resnet50 to four resnet18 in that same single large model, then I didn’t face this shared memory issue. Is there any way I can increase the shared memory in PyTorch? or do I need to modify the UNIX system? Thanks in advance.