CUDA OOM in distributed training (without NVlink)

I use dist.all_reduce on the parameters of my model. The parameters are on GPU. I use a GTX 1080, so there is no NVlink.

For some strange reasons, I get RuntimeError: CUDA error: out of memory, even though there is still enough memory left.

I use dist.init_process_group(backend=None), so that should init both NCCL + Gloo. As the parameters are on GPU, it uses NCCL for them. But as there is no NVlink, my assumption was that it would first copy it to CPU, then do the allreduce, then move it back to GPU. But maybe that is not the case? Does it collect all parameters from all workers first (CPU → CPU → network → CPU → GPU), so that it has num_workers * parameter_size in GPU memory, then does the reduce in GPU? But even if that would be the case, this still would not really explain the OOM. Because this GPU has 10.9GB memory, and the parameters itself only take 615.9MB of memory, and during this dist.all_reduce, not much other GPU memory should be taken, except maybe a bit reserved or for CUDA kernel caches or so.

Does NCCL use the same CUDA memory allocator as PyTorch? If not, maybe that is the problem, that PyTorch actually has reserved most of the memory?

Using Gloo causes a very similar error:

terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: out of memory

However, one workaround I found:

  • Before the all reduce, do"cpu"))
  • Then dist.all_reduce on all the params. (This would then use Gloo on CPU.)
  • Then"cuda")) again.

More details here: PyTorch CUDA OOM in distributed training · Issue #1482 · rwth-i6/returnn · GitHub
Maybe related PyTorch issue: OOM error for collection communication primitive provided by torch.distributed · Issue #116177 · pytorch/pytorch · GitHub
Maybe related NCCL issue: ncclInternalError during torch all_gather_object · Issue #962 · NVIDIA/nccl · GitHub

Hmm that is interesting. What is the world size? Is there a chance that tensors are mistakenly being created on a single device instead of per device? Are you calling torch.cuda.set_device for each rank?

Does NCCL use the same CUDA memory allocator as PyTorch?

To my understanding, NCCL doesn’t allocate CUDA memory, only PyTorch does. NCCL’s responsibility is to take a buffer of CUDA memory and communicate it to another device efficiently. If NVlink is not available it will use PCIe, but that does not require any GPU->CPU conversion.

Recently, we released a blog post about debugging OOMs with memory profiler. Perhaps that could help narrow down the issue Understanding GPU Memory 1: Visualizing All Allocations over Time | PyTorch

4 workers, all running on the same node.

No. It usually also works just fine, except now that I have a bit larger, different model.


I think this is wrong. The CUDA OOM error comes clearly from NCCL. NCCL calls some CUDA alloc. See also the referenced issues, e.g. the NCCL issue, where you see some NCCL debug output, showing that it tried to allocate some CUDA memory.

Can you try running with environment variable NCCL_DEBUG=INFO and get some of the nccl logs to see?

After using NCCL_DEBUG=INFO, I don’t get too much output, only this:

cn-246:1131332:1131332 [0] NCCL INFO Bootstrap : Using enp5s0:<0>
cn-246:1131332:1131332 [0] NCCL INFO NET/Plugin : Plugin load ( returned 2 : cannot open shared object file: No such file or directory
cn-246:1131332:1131332 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation
cn-246:1131332:1131332 [0] NCCL INFO cudaDriverVersion 12010
NCCL version 2.18.1+cuda12.1
cn-246:1131332:1131332 [0] NCCL INFO comm 0x8e232d40 rank 0 nranks 0 cudaDev 0 busId 0 - Abort COMPLETE
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Note, in the references NCCL issue, there is also this in the NCCL log, which I currently don’t see in my NCCL log:

node:3441504:3561297 [0] include/alloc.h:178 NCCL WARN Cuda failure 'out of memory'
node:3441504:3561297 [0] include/alloc.h:185 NCCL WARN Failed to CUDA calloc 10485760 bytes

I also wanted to ask about this. I was not aware, or did not realize that it would do that. Is that always possible on GPUs within a single node? How can I see that it indeed does that?