Dist.barrier got out of memory

I got this error randomly on the cluster. I am clueless and don’t know what the problem could be. One possibility is because multiple jobs may share the same machine, and it may cause some problem. But it’s unlikely that the gpus are occupied because I printed the nvidia-smi in the log and they are fine before running the code.

 File "/home/code/utils/distributed.py", line 128, in synchronize
   dist.barrier()
 File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1710, in barrier
   work = _default_pg.barrier()
RuntimeError: CUDA error: out of memory

If someone has seen this error and found the solution, please let me know.
If someone has a hint why this happens let me know.
One more thing, this is running others’ code, I don’t know why they put this barrier before making the model distributeddataparallel and after dataloader initialization. Is barrier suggested to be used?

I don’t think the barrier causes the OOM, but might catch the error if it’s synchronizing the code.
To avoid it, you would have to lower the batch size or slim down the model or inputs.

Also, make sure that the GPU memory is not increasing in each iteration, as this might happen if you are storing the complete computation graph via storing tensors which are still attached to the computation graph.

It actually happens before the first iteration. Very weird. Currently I just resubmit and hope this time it will work. My guess it may be the issue of some particular machine, but I don’t know where to start to diagnose it.

Are your GPUs empty before you start the training?
If that’s the case, you can check the memory after creating the model, data etc. via print(torch.cuda.memory_allocated()).

Also seen similar issues. After init_process_group, i run dist.barrier before creating data loader, creating model, … It crashes with OOM at dist.barrier, which is quite strange. Normally i resubmit the job, and the new job might be allocated to some new nodes, and then it can run smoothly sometimes. But this problem happens almost every day. GPU memory is 0 before it hits init_process_group(). pytorch 1.6.

I am hitting the same error. I got OOM before the model is created, with 30+ GB GPU memory available.
@amsword Is your issue resolved?

it is hardware issue on my side. Try to use different nodes or check if the node has hardware issue.

any updates on this?
i am hitting the same issue.
gpu of 32gb is CUDA error: out of memory while i am still parsing the args… it is so weird.
the code runs fine on a gpu with 16gb and uses about 11gb on a local machine.

it happened at barrier. i am using ddp but using only one gpu.

dist.barrier()
import torch.distributed as dist
 call dist.barrier()
  File "python3.7/site-packages/torch/distributed/distributed_c10d.py", line 2524, in barrier
    work = default_pg.barrier(opts=opts)
RuntimeError: CUDA error: out of memory

any idea? this happened on a cluster with multi-nodes. i am using only one node with single gpu.
pytorch 1.9.0
cuda 11.1.1
gpu type: nvidia v100 volta

thanks

Could you update to the latest stable release (or the nightly) and check, if you are hitting the same issue? If so, could you post a minimal executable code snippet to reproduce the issue, please?

i have to ask admins to upgrade to torch 1.10.0 and if possible to nightly.
they rebuild pytorch to optimize it on the hardware of the cluster.
i’ll let you know.
i already asked to upgrade to 1.10.0 (+torchvision) couple days ago. it takes time…
i dont want to install from pypi as it may not reproduce the same error.
and it is recommended, by admins, to install prebuilt (on cluster) packages.

thanks

i upgraded to torch 1.10.0 and torchvision 0.11.1 that have been built on a server using cuda 11.4.
now, i am getting a different error at the first call (probably during loading torch libs). the error does not point to a line in my code. but it seems related to loading distributed …
any idea what could be the cause of this error?

terminate called after throwing an instance of 'c10::Error'
  what():  Socket Timeout
Exception raised from recvBytes at /tmp-folder/pytorch_build_2021-11-09_14-57-01/moduleX/python3.7/pytorch/torch/csrc/distributed/c10d/Utils.hpp:619 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x55 (0x2b10898cb905 in VENV-NAME/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0xd6 (0x2b10898ad2a9 in VENV-NAME/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #2: c10d::TCPStore::doWait(c10::ArrayRef<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0xe8 (0x2b10673a9058 in VENV-NAME/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)

the same versions installed using conda work fine on a local server.

thanks

Hello Soufiane Belharbi,
Do you solve this issue? I encountered exactly the same problem. Thank you!

hi Yuhongze,
short answer: no.

both errors i mentioned here raised using pytorch that was built on server A(with multi-nodes).
on a different server B (single machine, multi-gpus), pytorch works fine (installed using conda).

to make the code work on server A, i had to install the upstream version of pytorch/torchvision instead of the built version. this seems to have solved all issues. i think the built version has issues with ddp.

pip install torch==1.10.0 -f https://download.pytorch.org/whl/cu111/torch-1.10.0%2Bcu111-cp37-cp37m-linux_x86_64.whl

pip install torchvision==0.11.1 -f https://download.pytorch.org/whl/cu111/torchvision-0.11.1%2Bcu111-cp37-cp37m-linux_x86_64.whl

torch: 1.10.0
torchvision: 0.11.1

also, this may depend on servers, the distributed backend could also be the cause. in the second error above, the error happened when calling tcp.
on one server, mpi works better (== no error) than gloo. others, nccl.
so, try to change the backend. also, see the recommended backend for ddp on your server.
i didnt test this, but harovod could be another option for backend: Overview — Horovod documentation . you may need to change slightly the code for this backend: Horovod with PyTorch — Horovod documentation .

also, you can downgrade to torch 1.9.x.

let me know how it goes.

thanks

Reset all gpus resolved the problem.

sudo nvidia-smi -r

you can check active process using gpu by (you should exit all of them before reset)

sudo lsof /dev/nvidia*