Hey, I am using DDP for model training. In the script, I add a line as,
dist.init_process_group(backend=“nccl”) besides using model = DDP(model)
Then, I ran the script in command line as,
python -m torch.distributed.launch --nproc_per_node=8 --nnode=1 --node_rank=0 paraphrase_simpletransformers.py
But then it will say OOM after it is running. I checked nvidia-smi, I found that each pid must have one gpu and also occupy the 0th gpu. That is why OOM happens. Not sure why this happens? May you give help here, thanks in advance.
±------------------------------±---------------------±---------------------+
±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 82610 C …e/anaconda3/envs/pytorch_p36/bin/python 3071MiB |
| 0 82611 C …e/anaconda3/envs/pytorch_p36/bin/python 409MiB |
| 0 82612 C …e/anaconda3/envs/pytorch_p36/bin/python 409MiB |
| 0 82614 C …e/anaconda3/envs/pytorch_p36/bin/python 403MiB |
| 0 82615 C …e/anaconda3/envs/pytorch_p36/bin/python 409MiB |
| 0 82616 C …e/anaconda3/envs/pytorch_p36/bin/python 411MiB |
| 0 82617 C …e/anaconda3/envs/pytorch_p36/bin/python 405MiB |
| 0 82618 C …e/anaconda3/envs/pytorch_p36/bin/python 409MiB |
| 1 82611 C …e/anaconda3/envs/pytorch_p36/bin/python 1517MiB |
| 2 82612 C …e/anaconda3/envs/pytorch_p36/bin/python 1517MiB |
| 3 82614 C …e/anaconda3/envs/pytorch_p36/bin/python 1497MiB |
| 4 82615 C …e/anaconda3/envs/pytorch_p36/bin/python 1497MiB |
| 5 82616 C …e/anaconda3/envs/pytorch_p36/bin/python 1537MiB |
| 6 82617 C …e/anaconda3/envs/pytorch_p36/bin/python 1537MiB |
| 7 82618 C …e/anaconda3/envs/pytorch_p36/bin/python 1537MiB |
±----------------------------------------------------------------------------+