Strange number of processes per GPU

I have a problem very similar to A strange problem about the gpu processes and num_workers

In particular the following happens and I don’t understand why there are extra processes per GPU that are consuming a lot of GPU ram:

Could you please advise what could be the problem here?

I’m using PyTorch-Lightning for training using multiple GPUs. I have 4 V100 GPUs and the following dependencies:

  1. torch==1.7.1
  2. torchvision==0.8.2
  3. CUDA 11.0
1 Like

I guess that you are using a distributed training setup (via DDP).
If that’s the case, it seems that the setup is incorrect, as apparently each process creates a new CUDA context on each device. I’m not familiar with Lightning, but you could check, if your script only uses the specified GPU and doesn’t allocate tensors on all visible devices.

How can I check this specifically? Shall I check whether the tensors are moved to specific devices? Any suggestions?

You could check, if all devices are visible in each process and check then for to('cuda') or cuda() calls.
If these operations are used, the tensor or module would be moved to the default device (GPU:0) in each process and could thus create multiple CUDA contexts.

I have met a similar problem.And I am learning how to use ‘DistributedDataParallel’.And it’s really strange that why there are extre processes per GPU that are consuming 0 GPU ram.

To reproduce:distributed_tutorial/ at master · yangkky/distributed_tutorial · GitHub
I follow the code completely and only had three modifications.
` parser.add_argument(’-g’, ‘–gpus’, default=3, type=int,
help=‘number of gpus per node’)

os.environ[‘MASTER_ADDR’] = ‘’
os.environ[‘MASTER_PORT’] = ‘23556’`

And my Command line argument was ‘python -n 1 -g 3 -nr 0’
PID 6283 6284 6285 are my processes.And GPU 3 is being used by others
my dependencies: pytorch == 1.6.0 torchvision == 0.7.0 CUDA 10.2 3 x RTX 2080Ti