Distributed Data Parallel Training Extra GPU n-1 process on n GPU process job

I’m using DDP for training (using DDP wrapper around the model) and when I spawn n jobs I see the process on with nvidia-smi/gpustat but then there are (n-1) other GPU processes that are also created. I’m guessing that these are for communication between gpu0 and the other processes. I don’t see anything in the documentation or this forum about that. Below is output from gpustat for a 2 GPU job. The 2 GPU process using memory 9763 are the model and then there is 565M size process on gpu0. The latter process seems to be about the same size for any model.

So is this normal or is it an incorrect setup?
Thanks.

Could you please check the process ids? Each DDP process is supposed to only work on one device. Suppose the expected case is process rank 0 on cuda:0 and process rank1 on cuda:1. It’s likely that somehow process rank1 also created a CUDA context on cuda:0. The size of 565M also looks like a CUDA context.

If it’s indeed the case, there are a few actions can help avoid such problem:

  1. Run torch.cuda.set_device(rank) on each process to properly set the current device before running any CUDA ops.
  2. If 1 still does not solve the problem, you can set CUDA_VISIBLE_DEVICES env var to make sure that process 0 only sees gpu0 and process 1 only sees gpu1.