I was running the imagenet using DDP.
At the DGX station 4*V100, they make many processes
But i running the same option and same program on RTX2080Ti*8 server. They make just only 9 Process
I running the same code and same program… If I think the same case on RTX2080TI*8 server, I expect DGX to also make 5 processes. But it was not.
Why this problem, I used the same PyTorch docker on both systems.
And If I want to drive the number of processes, how do I contorl it?
What are these processes assigned to?
Could you extend the drop down menu and compare both machines?
Also, how many CPUs does the RTX2080Ti workstation have? Note that your DGX station reports 40, which might also be the reason why it’s able to use more processes.
Now I guess, this problem is xorg or other process default program which was running at the same time. Thanks to helping me.
Can I ask another problem…?
Now I guess very long allreduce and Broadcast which caused by late Memcpy H2D. If you see the above figure GPU 1,3,4 call the H2D(=green bar) at the similar time, but GPU 2 called the H2D at late. These unsync behavior make inefficient at multi-GPU training.
So all GPU sync on GPU2, can I solve this problem…? For example, I allocated the memcpyH2D to new stream or any nice idea…?