CUDA issue with DDP: CUDA out of Memory error (rank 0)

I am experiencing issues when using Vision transformer based model with DistributedDataParallel (DDP). While everything works fine with just DP, switching to DDP causes the following problems.

Problem 1. CUDA out of memory on rank 0 GPU

This issues doesn’t occur with DP. Despite setting the device to rank and calling torch.cuda.empty_cache() before loading the model, memory allocation on rank 0 leads to CUDA out of memory.

    os.environ['MASTER_ADDR'] = 'localhost'
    os.environ['MASTER_PORT'] = '12355'
    torch.cuda.set_device(rank)
    torch.cuda.empty_cache()
    dist.init_process_group("nccl", rank=rank, world_size=world_size)
    encoder = encoder.to(rank)
    mlp = mlp.to(rank)
    encoder = DDP(encoder, device_ids=[rank])
    mlp = DDP(mlp, device_ids=[rank])
...
        for bat_idx, (x, label) in enumerate(dataloader):
            x, label = x.to(rank), label.to(rank).long()
            emb = encoder(x)

  File "", line 282, in
    emb = encoder(x)
  File "/home/user/miniconda3/envs/deploy/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/miniconda3/envs/deploy/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/miniconda3/envs/deploy/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1593, in forward
    else self._run_ddp_forward(*inputs, **kwargs)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 430.00 MiB. GPU

Problem 2: CUDA Error: Invalid Device ordinal

When setting CUDA_VISIBLE_DEVICES=“2,3,4,5” , I encountered ‘CUDA error: invalid device ordinal’ when calling ‘torch.cuda.set_device(rank)’. This error does not occur if I don’t set ‘CUDA_VISIBLE_DEVICES’.

 File "", line 212, in
    torch.cuda.set_device(rank)
  File "/home/user/miniconda3/envs/deploy/lib/python3.9/site-packages/torch/cuda/__init__.py", line 399, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Any suggestions to address these issues would be greatly appreciated.
Thank you!

Check the memory usage via nvidia-smi to see if e.g. multiple processes are running on the default device.

For the second issue: narrow down which line of code fails and make sure each process uses only its corresponding device.

For the first question, it was due to a misunderstanding on my part.
I believed that if I set batch_size=256 for 4 GPUs, then each GPU would be given 64 samples, similar to how DP scatters the data.
However, in DDP, the same batch size (256) was given to each GPU.