Spawned Processes with DDP

kleingeo · July 25, 2020, 4:22pm

When I run a model with DDP with 4 spawned processes and each process taking a GPU, I notice that there are a lot of spawned processes with nvidia-smi on the main process. Why are there so many additional processes spawned? I am also using torch=1.4.0 and CUDA=10.1.243 which is installed through my conda environment.

iffiX · July 25, 2020, 4:44pm

See discussion `torch.distributed.barrier` used in multi-node distributed data-parallel training

For more detailed explainations, cc @mrshenli

kleingeo · July 25, 2020, 5:06pm

I took a look at that, but I am a little confused as it still. Is it a back-end issue, or is it normal behaviour?

iffiX · July 26, 2020, 2:17am

should be normal，depending on the backend

kleingeo · July 26, 2020, 2:17pm

It seems to happen with both nccl and gloo backends.

iffiX · July 26, 2020, 3:21pm

Maybe ask @mrshenli, I am not fully clear about their internal implementations.

kleingeo · July 26, 2020, 4:22pm

So I played around a little more by breaking my pipeline to a very simple training loop, just to ensure I wasn’t doing anything wrong. I then cloned my conda environment and updated the torch and cuda to the versions listed on the Getting Started page (the most up-to-date) and it seems to fix the issue. So I’m not sure if it was a cuda, nccl (in the cudatoolkit), or the updated torch, but updating does fix it.

mrshenli · July 27, 2020, 4:40pm

Those (~500MB CUDA memory consumption) look like CUDA context. It seems all processes somehow used CUDA:0 somewhere. It could be caused by 3rd-party libraries or calls like torch.cuda.empty_cache(). If you don’t want to debug the root cause of it, you can avoid it by setting CUDA_VISIBLE_DEVICES env var to make sure that each process only sees one GPU.

kleingeo · August 6, 2020, 1:26pm

I updated both my torch and cudatoolkit (with conda) to the newest versions which seemed to fix the problem. I’m not sure if it was cuda, torch or how torch interacted with cuda, but updating seems to have fixed the bug. I’m not sure if this improves anything though, but the extra processes don’t show up anymore.