Dist.barrier got out of memory

sbelharbi · December 14, 2021, 4:07pm

hi Yuhongze,
short answer: no.

both errors i mentioned here raised using pytorch that was built on server A(with multi-nodes).
on a different server B (single machine, multi-gpus), pytorch works fine (installed using conda).

to make the code work on server A, i had to install the upstream version of pytorch/torchvision instead of the built version. this seems to have solved all issues. i think the built version has issues with ddp.

pip install torch==1.10.0 -f https://download.pytorch.org/whl/cu111/torch-1.10.0%2Bcu111-cp37-cp37m-linux_x86_64.whl

pip install torchvision==0.11.1 -f https://download.pytorch.org/whl/cu111/torchvision-0.11.1%2Bcu111-cp37-cp37m-linux_x86_64.whl

torch: 1.10.0
torchvision: 0.11.1

also, this may depend on servers, the distributed backend could also be the cause. in the second error above, the error happened when calling tcp.
on one server, mpi works better (== no error) than gloo. others, nccl.
so, try to change the backend. also, see the recommended backend for ddp on your server.
i didnt test this, but harovod could be another option for backend: Overview — Horovod documentation . you may need to change slightly the code for this backend: Horovod with PyTorch — Horovod documentation .

also, you can downgrade to torch 1.9.x.

let me know how it goes.

thanks