Distributed training creates multiple processes in GPU0


I’ve recently started using the distributed training framework for PyTorch and followed the imagenet example. I’m using multi-node multi-GPU training. While running the code, during the 1st epoch itself, I see multiple processes starting at GPU 0 of both the servers. They are not present initially when I start the training. From the GPU memory usage, it seems that the other processes are some copy of the model (they all have a fixed usage like 571M). Since running an epoch takes ~12 hours for my use case, debugging step by step is not exactly a feasible solution. I’ve ensured that I pass args.gpu as argument whenever I do a .cuda() call. Also, the model loading/saving is done as suggested in the imagenet example.

Are there any pointers to the probable cause of the issue (or some intelligent ways to debug the code)? Thanks in advance.

Could you please share the cmd you used to launch the processes?

Does the problem disappear if you set CUDA_VISIBLE_DEVICES when launching the process and not passing in --gpu (let it use the default and only visible one)?

Hi Shen,

I always set the CUDA_VISIBLE_DEVICES for each run using export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 for example. The code runs on all the 8 GPUs with full utilization, so multiprocessing is surely working. The command I use is as follows on the two servers I’m using (with appropriate IP and port set):

python train_distributed.py --dist-url 'tcp://ip:port' --dist-backend 'gloo' --multiprocessing-distributed --world-size 2 --rank 0
python train_distributed.py --dist-url 'tcp://ip:port' --dist-backend 'gloo' --multiprocessing-distributed --world-size 2 --rank 1

Found the bug. So we need to be careful with setting the right GPU context while calling clear_cache() function, otherwise it allocates fixed memory on GPU0 for the other GPUs. Relevant issue here.

1 Like

I’m having the same problem now, but the sample code doesn’t call clear_cache()

Hi, what is your solution to this problem?

I had to basically check all functions to see if they are passed the right GPU as the context. So, you should check any function that’s creating some tensors, etc. For my case, the issue was that in the clear_cache() function, I was not passing the GPU context correctly. I think any torch function that accepts a device parameter needs to be called in the right context, otherwise, it might default to GPU0.

I also ran the example you mentioned https://github.com/pytorch/examples/blob/main/imagenet/main.py, but I checked carefully and there was nothing suspicious.