I was wondering is there any way to have dataloader with multiple workers but not running the docker with --ipc=host (i.e. .nvidia-docker run --rm -ti)? I wasn’t able to use num_workers > 0 when not using --ipc=host.
Asking because, on a gpu cluster machines that lunch a job via docker, all processes are running inside the same container, and are being inside the same ipc namespace. Since Pytorch’s containers will be running on a host with other jobs too, let’s say with Tensforflow jobs, it might cause a problem [?]
Hello is this issue resolved by now? I have the same problem. I have a machine learning project using pytorch that is trained remotely. However the docker container is not started using the flags (–ipc=host nor --shm-size). The pytorch documentation (https://github.com/pytorch/pytorch#docker-image) says this is required to run multiple workers in a docker container. Setting the number of workers to 0 is not an adequate solution, because then training takes takes 10 times longer. Is there any way to get pytorch dataloaders working in a docker container? But: We are able to create the dockerfile by our selfs? Can we define it there? Can we apply a different method?
What issue are you seeing with
--ipc=host or setting the shared memory size via e.g.
--shm-size 8g ?
I do not see any issues at all, but I am not able to set the flags myself. It is up to the admin. And he does not like to change the current settings for compatibility reasons (other users other frameworks). The only thing I am able to provide is a Dockerfile and the python code itself.
I’m not a docker expert, but to my understanding without these flags docker would only use a tiny amount of shared memory and thus (some) multiprocessing applications wouldn’t work, as they are unable to share data.
If your admin cannot use these flags, you would either have to use a single worker (or a lower number) or maybe swap docker for another container solution (I’m unfortunately not familiar with other approaches).
Hi @hanshans, I am faced with the same issue where I am provided with a docker without
--ipc==host and any pytorch dataloader with more than 0 threads is getting killed. Were you able to solve this issue? Thanks.