Docker with no ipc=host but multiple workers

rasoolfa · August 22, 2017, 2:20am

Hi,

I was wondering is there any way to have dataloader with multiple workers but not running the docker with --ipc=host (i.e. .nvidia-docker run --rm -ti)? I wasn’t able to use num_workers > 0 when not using --ipc=host.

Asking because, on a gpu cluster machines that lunch a job via docker, all processes are running inside the same container, and are being inside the same ipc namespace. Since Pytorch’s containers will be running on a host with other jobs too, let’s say with Tensforflow jobs, it might cause a problem [?]

hanshans · August 15, 2020, 7:32pm

Hello is this issue resolved by now? I have the same problem. I have a machine learning project using pytorch that is trained remotely. However the docker container is not started using the flags (–ipc=host nor --shm-size). The pytorch documentation (https://github.com/pytorch/pytorch#docker-image) says this is required to run multiple workers in a docker container. Setting the number of workers to 0 is not an adequate solution, because then training takes takes 10 times longer. Is there any way to get pytorch dataloaders working in a docker container? But: We are able to create the dockerfile by our selfs? Can we define it there? Can we apply a different method?

ptrblck · August 18, 2020, 7:00am

What issue are you seeing with --ipc=host or setting the shared memory size via e.g. --shm-size 8g ?

hanshans · August 20, 2020, 9:42pm

I do not see any issues at all, but I am not able to set the flags myself. It is up to the admin. And he does not like to change the current settings for compatibility reasons (other users other frameworks). The only thing I am able to provide is a Dockerfile and the python code itself.

ptrblck · August 21, 2020, 7:43am

I’m not a docker expert, but to my understanding without these flags docker would only use a tiny amount of shared memory and thus (some) multiprocessing applications wouldn’t work, as they are unable to share data.
If your admin cannot use these flags, you would either have to use a single worker (or a lower number) or maybe swap docker for another container solution (I’m unfortunately not familiar with other approaches).

DeepakSaini119 · November 24, 2021, 6:28pm

Hi @hanshans, I am faced with the same issue where I am provided with a docker without --ipc==host and any pytorch dataloader with more than 0 threads is getting killed. Were you able to solve this issue? Thanks.