Problem in deploying DDP

leonleeldc · February 7, 2021, 12:01am

Hey, I am using DDP for model training. In the script, I add a line as,
dist.init_process_group(backend=“nccl”) besides using model = DDP(model)

Then, I ran the script in command line as,
python -m torch.distributed.launch --nproc_per_node=8 --nnode=1 --node_rank=0 paraphrase_simpletransformers.py

But then it will say OOM after it is running. I checked nvidia-smi, I found that each pid must have one gpu and also occupy the 0th gpu. That is why OOM happens. Not sure why this happens? May you give help here, thanks in advance.

±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 82610 C …e/anaconda3/envs/pytorch_p36/bin/python 3071MiB |
| 0 82611 C …e/anaconda3/envs/pytorch_p36/bin/python 409MiB |
| 0 82612 C …e/anaconda3/envs/pytorch_p36/bin/python 409MiB |
| 0 82614 C …e/anaconda3/envs/pytorch_p36/bin/python 403MiB |
| 0 82615 C …e/anaconda3/envs/pytorch_p36/bin/python 409MiB |
| 0 82616 C …e/anaconda3/envs/pytorch_p36/bin/python 411MiB |
| 0 82617 C …e/anaconda3/envs/pytorch_p36/bin/python 405MiB |
| 0 82618 C …e/anaconda3/envs/pytorch_p36/bin/python 409MiB |
| 1 82611 C …e/anaconda3/envs/pytorch_p36/bin/python 1517MiB |
| 2 82612 C …e/anaconda3/envs/pytorch_p36/bin/python 1517MiB |
| 3 82614 C …e/anaconda3/envs/pytorch_p36/bin/python 1497MiB |
| 4 82615 C …e/anaconda3/envs/pytorch_p36/bin/python 1497MiB |
| 5 82616 C …e/anaconda3/envs/pytorch_p36/bin/python 1537MiB |
| 6 82617 C …e/anaconda3/envs/pytorch_p36/bin/python 1537MiB |
| 7 82618 C …e/anaconda3/envs/pytorch_p36/bin/python 1537MiB |
±----------------------------------------------------------------------------+

leonleeldc · February 7, 2021, 12:04am

I also tried to change the number of npro_per-node=4 as following,
python -m torch.distributed.launch --nproc_per_node=4 --nnode=1 --node_rank=0 paraphrase_simpletransformers.py

Then, I did nvidia-smi, it becomes,
±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 85910 C …e/anaconda3/envs/pytorch_p36/bin/python 4925MiB |
| 0 85911 C …e/anaconda3/envs/pytorch_p36/bin/python 3175MiB |
| 0 85912 C …e/anaconda3/envs/pytorch_p36/bin/python 3175MiB |
| 0 85913 C …e/anaconda3/envs/pytorch_p36/bin/python 3175MiB |
| 1 85910 C …e/anaconda3/envs/pytorch_p36/bin/python 1421MiB |
| 1 85911 C …e/anaconda3/envs/pytorch_p36/bin/python 1393MiB |
| 2 85910 C …e/anaconda3/envs/pytorch_p36/bin/python 1431MiB |
| 2 85912 C …e/anaconda3/envs/pytorch_p36/bin/python 1465MiB |
| 3 85910 C …e/anaconda3/envs/pytorch_p36/bin/python 1433MiB |
| 3 85913 C …e/anaconda3/envs/pytorch_p36/bin/python 1465MiB |
| 4 85910 C …e/anaconda3/envs/pytorch_p36/bin/python 1431MiB |
| 5 85910 C …e/anaconda3/envs/pytorch_p36/bin/python 1433MiB |
| 6 85910 C …e/anaconda3/envs/pytorch_p36/bin/python 1433MiB |
| 7 85910 C …e/anaconda3/envs/pytorch_p36/bin/python 1433MiB |
±----------------------------------------------------------------------------+

osalpekar · February 7, 2021, 7:15am

Can you set the env var CUDA_VISIBLE_DEVICES and specify which GPUs to use before launching training? This will at least clarify whether this is a problem with processes binding to GPUs or something else.

Here’s a good explanation of the env var: cuda - How do I select which GPU to run a job on? - Stack Overflow