I am experiencing issues when using Vision transformer based model with DistributedDataParallel (DDP). While everything works fine with just DP, switching to DDP causes the following problems.
Problem 1. CUDA out of memory on rank 0 GPU
This issues doesn’t occur with DP. Despite setting the device to rank and calling torch.cuda.empty_cache() before loading the model, memory allocation on rank 0 leads to CUDA out of memory.
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '12355'
torch.cuda.set_device(rank)
torch.cuda.empty_cache()
dist.init_process_group("nccl", rank=rank, world_size=world_size)
encoder = encoder.to(rank)
mlp = mlp.to(rank)
encoder = DDP(encoder, device_ids=[rank])
mlp = DDP(mlp, device_ids=[rank])
...
for bat_idx, (x, label) in enumerate(dataloader):
x, label = x.to(rank), label.to(rank).long()
emb = encoder(x)
File "", line 282, in
emb = encoder(x)
File "/home/user/miniconda3/envs/deploy/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/user/miniconda3/envs/deploy/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/home/user/miniconda3/envs/deploy/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1593, in forward
else self._run_ddp_forward(*inputs, **kwargs)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 430.00 MiB. GPU
Problem 2: CUDA Error: Invalid Device ordinal
When setting CUDA_VISIBLE_DEVICES=“2,3,4,5” , I encountered ‘CUDA error: invalid device ordinal’ when calling ‘torch.cuda.set_device(rank)’. This error does not occur if I don’t set ‘CUDA_VISIBLE_DEVICES’.
File "", line 212, in
torch.cuda.set_device(rank)
File "/home/user/miniconda3/envs/deploy/lib/python3.9/site-packages/torch/cuda/__init__.py", line 399, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Any suggestions to address these issues would be greatly appreciated.
Thank you!