Problem for Init method on DDP

omidnr · May 2, 2024, 3:24pm

I started to learn Pytorch recently and I have to make a parallel process using pytorch DDP, i followed tutorials on the official website but I have a problem that I am facing a problem when I move from single GPU to Multi GPU. I use Jupyter Notebook server to write my code and I have direct access to physical server with GPUs (not a cloud service) i used torchrun to ease the things, here I modified the code to make it explicitly work on one GPU. I am also writing the original error here maybe someone can help me fix this, I think I have problems writing the INIT METHOD part or there should be a detail that I miss.

def setup_ddp():
init_process_group(backend=“nccl”, init_method=“file:///tmp/myfile”)

class Trainer:
def init(
self,
model: torch.nn.Module,
train_data: DataLoader,
optimizer: torch.optim.Optimizer,
save_every: int,
) → None:
self.gpu_id = int(os.environ[“LOCAL_RANK”])
self.model = model.to(self.gpu_id)
self.train_data = train_data
self.optimizer = optimizer
self.save_every = save_every
self.epochs_run = 0
self.model = DDP(model, device_ids=[self.gpu_id])
def main():

    import os
    os.environ["WORLD_SIZE"] = "1"  # explicitly force to have just one gpu
        
    # torch.cuda.set_device(rank)
    # parameters
    total_epochs = 2
    save_every = 1
    batch_size = 32
    
    setup_ddp()
    dataset, model, optimizer = load_train_objs()
    train_data = prepare_dataloader(dataset, batch_size)        
    trainer = Trainer(model, train_data, optimizer, save_every)
    trainer.train(total_epochs)
    # destroy to clean up
    destroy_process_group()
    print("Checkpoint after Clean up")

if name == “main”:
main()

ERROR:

python DDP_torchrun.py
Using device: cuda
Using device: cuda
Running DDP test on rank 0.
Using device: cuda
Running DDP test on rank 2.
Checkpoint after setupddp
Using device: cuda
Running DDP test on rank 1.
Checkpoint after setupddp
Checkpoint after setupddp
Files already downloaded and verified
Files already downloaded and verified
Files already downloaded and verified
Checkpoint load train objs
Checkpoint after train data
Checkpoint load train objs
Checkpoint after train data
Checkpoint load train objs
Checkpoint after train data
d784bed9d024:9720:9720 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.2<0>
d784bed9d024:9720:9720 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
d784bed9d024:9720:9720 [0] NCCL INFO cudaDriverVersion 12000
NCCL version 2.19.3+cuda12.3
d784bed9d024:9722:9722 [0] NCCL INFO cudaDriverVersion 12000
d784bed9d024:9722:9722 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.2<0>
d784bed9d024:9722:9722 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
d784bed9d024:9722:9780 [0] NCCL INFO Failed to open libibverbs.so[.1]
d784bed9d024:9722:9780 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
d784bed9d024:9722:9780 [0] NCCL INFO Using non-device net plugin version 0
d784bed9d024:9722:9780 [0] NCCL INFO Using network Socket
d784bed9d024:9720:9781 [0] NCCL INFO Failed to open libibverbs.so[.1]
d784bed9d024:9720:9781 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
d784bed9d024:9720:9781 [0] NCCL INFO Using non-device net plugin version 0
d784bed9d024:9720:9781 [0] NCCL INFO Using network Socket
d784bed9d024:9721:9721 [0] NCCL INFO cudaDriverVersion 12000
d784bed9d024:9721:9721 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.2<0>
d784bed9d024:9721:9721 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
d784bed9d024:9721:9782 [0] NCCL INFO Failed to open libibverbs.so[.1]
d784bed9d024:9721:9782 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
d784bed9d024:9721:9782 [0] NCCL INFO Using non-device net plugin version 0
d784bed9d024:9721:9782 [0] NCCL INFO Using network Socket
d784bed9d024:9721:9782 [0] NCCL INFO comm 0x563ed4aaac90 rank 1 nranks 3 cudaDev 0 nvmlDev 0 busId 18000 commId 0xe4aaadc79d89d14e - Init START
d784bed9d024:9722:9780 [0] NCCL INFO comm 0x56271462c250 rank 2 nranks 3 cudaDev 0 nvmlDev 0 busId 18000 commId 0xe4aaadc79d89d14e - Init START
d784bed9d024:9720:9781 [0] NCCL INFO comm 0x5569f44a8730 rank 0 nranks 3 cudaDev 0 nvmlDev 0 busId 18000 commId 0xe4aaadc79d89d14e - Init START

d784bed9d024:9722:9780 [0] init.cc:824 NCCL WARN Duplicate GPU detected : rank 2 and rank 0 both on CUDA device 18000

d784bed9d024:9720:9781 [0] init.cc:824 NCCL WARN Duplicate GPU detected : rank 0 and rank 1 both on CUDA device 18000

d784bed9d024:9721:9782 [0] init.cc:824 NCCL WARN Duplicate GPU detected : rank 1 and rank 0 both on CUDA device 18000
d784bed9d024:9720:9781 [0] NCCL INFO init.cc:1396 → 5
d784bed9d024:9722:9780 [0] NCCL INFO init.cc:1396 → 5
d784bed9d024:9721:9782 [0] NCCL INFO init.cc:1396 → 5
d784bed9d024:9722:9780 [0] NCCL INFO group.cc:64 → 5 [Async thread]
d784bed9d024:9721:9782 [0] NCCL INFO group.cc:64 → 5 [Async thread]
d784bed9d024:9720:9781 [0] NCCL INFO group.cc:64 → 5 [Async thread]
d784bed9d024:9721:9721 [0] NCCL INFO group.cc:418 → 5
d784bed9d024:9720:9720 [0] NCCL INFO group.cc:418 → 5
d784bed9d024:9721:9721 [0] NCCL INFO group.cc:95 → 5
d784bed9d024:9720:9720 [0] NCCL INFO group.cc:95 → 5
d784bed9d024:9722:9722 [0] NCCL INFO group.cc:418 → 5
d784bed9d024:9722:9722 [0] NCCL INFO group.cc:95 → 5
Traceback (most recent call last):
File “/home/omid/DDP_torchrun.py”, line 158, in
mp.spawn(main, args=(world_size,), nprocs=n_gpus)
File “/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py”, line 241, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method=“spawn”)
File “/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py”, line 197, in start_processes
while not context.join():
File “/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py”, line 158, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

– Process 1 terminated with the following error:
Traceback (most recent call last):
File “/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py”, line 68, in _wrap
fn(i, *args)
File “/home/omid/DDP_torchrun.py”, line 143, in main
trainer = Trainer(model, train_data, optimizer, rank, world_size, save_every)
File “/home/omid/DDP_torchrun.py”, line 65, in init
self.model = DDP(model.cuda(),device_ids=[rank]) # wrap the model with DDP .cuda()
File “/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py”, line 798, in init
_verify_param_shape_across_processes(self.process_group, parameters)
File “/usr/local/lib/python3.10/dist-packages/torch/distributed/utils.py”, line 263, in _verify_param_shape_across_processes
return dist._verify_params_across_processes(process_group, tensors, logger)
torch.distributed.DistBackendError: NCCL error in: …/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.19.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library.
Last error:
Duplicate GPU detected : rank 1 and rank 0 both on CUDA device 18000