Number of processes per GPU while running distributed data parallel

Jiaji_Huang · March 10, 2022, 7:31am

I have a question on basic concept. When we launch a distributed data parallel job over n GPUs, how many processes should each GPU have?

I ask this because I notice different patterns while typing nvidia-smi. Sometimes, each GPU has one single PID. And in total, there are n different PIDs. That’s what I would expect.

However, sometimes each GPU has n PIDs, with one PID using actual memory and all other (n-1) using 0 memory. Moreover, all GPUs shares the same PIDs. That’s hard for me to understand. Are the PIDs taking zero memory due to communication?

rvarm1 · March 10, 2022, 4:17pm

You’re right that DDP training should use one GPU per process.

Could you give an example nvidia-smi output that indicates the second case that you’ve mentioned? In general, this might happen if memory is allocated or operations are done on a GPU device that is not “assigned” to that process, although if DDP itself, and not the application, is doing this it is likely an issue we should fix. A script to reproduce would also be helpful.

Jiaji_Huang · March 10, 2022, 5:03pm

This is a screenshot of nvidia-smi’s output, for the 2nd case. Let me explain a bit. There are 10 GPUs on this node. Someone else is using GPU 1 and 2. GPU 6-9 is not used. I’m using GPU 0, 3, 4, 5, running a DDP job with world_size=4, i.e., n=4.