Initialize DDP with torch.distributed.run/torchrun

Getting Started with Distributed Data Parallel — PyTorch Tutorials 1.12.0+cu102 documentation)-,Initialize%20DDP%20with%20torch.distributed.run/torchrun,-We%20can%20leverage

I am trying to tun this DDP example but I am getting this error:
ValueError: The hostname of the rendezvous endpoint ‘:29400’ must be a dot-separated list of labels, an IPv4 address, or an IPv6 address.

I guess you haven’t set the MASTER_ADDR as described in the tutorial?

The earlier problem was resolved, but got a new problem
while setting gloo in init_method… it stucks in loss.backward and produces the below error:
RuntimeError: […/third_party/gloo/gloo/transport/tcp/pair.cc:589] Read error [10.245.10.159]:15026: Connection reset by peer

while setting nccl in init_method… it stucks in
ddp_model = DistributedDataParallel(model, device_ids=[local_rank])
and produces
RuntimeError: NCCL error in: …/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, unhandled system error, NCCL version 21.0.3

ncclSystemError: System call (socket, malloc, munmap, etc) failed.

        dist.init_process_group("nccl", init_method='env://', world_size=size, rank=rank)
        gpu = torch.device("cuda",local_rank)

        if args["model"] == "srcnn":
            model = SRCNN.Net().to(gpu)

        ddp_model = DistributedDataParallel(model, device_ids=[local_rank])
        
        #initializing checkpoint
        best_val_metric = 0
        checkpoint_file = args["srcnn_x2"]    
        
        map_location = {'cuda:%d' % 0: 'cuda:%d' % local_rank}

        state_dict = torch.load(checkpoint_file, map_location=map_location)
        model.load_state_dict(state_dict)

        # initializing loss, optimizer, scheduler s
        batch_size = args["train_batch_size"]
        batch_size_per_gpu = batch_size // size
        criterion = torch.nn.MSELoss(reduction='mean').cuda(gpu)
        optimizer = optim.Adam(ddp_model.parameters(), lr=0.01)
        scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', factor=0.1, patience=4)

        traindataset = FitsDataset(args, data_list = trainpathlist)     
        valdataset = FitsDataset(args, data_list = valpathlist)  
        
        train_sampler = torch.utils.data.distributed.DistributedSampler(traindataset,
                                                                        num_replicas= size,
                                                                        rank= rank)
        val_sampler = torch.utils.data.distributed.DistributedSampler(valdataset,
                                                                        num_replicas= size,
                                                                        rank= rank)
        
        trainLoader = DataLoader(traindataset, batch_size=args["train_batch_size"], shuffle=False,       
                                     num_workers=0, pin_memory=True, sampler=train_sampler)
        
        valLoader = DataLoader(valdataset, batch_size=args["val_batch_size"], shuffle=False, 
                                   num_workers=0, pin_memory=True, sampler=val_sampler)
        
        best_ckpt = {"epoch":-1, "current_val_metric":0, "model":ddp_model.state_dict()}
        epoch = args["start_iters"]

        if  rank == 0: start = datetime.now()         
        total_step = len(trainLoader)

        for epoch in range(args["max_epochs"]):
            
            if  rank == 0: start_dataload = time()
            total_step = len(trainLoader)

            for i, ( image, label) in enumerate(trainLoader):
                image = image.to(torch.float32)
                label = label.to(torch.float32)
                
                image = image.to(gpu, non_blocking=True)
                label = label.to(gpu, non_blocking=True)

                if  rank == 0: stop_dataload = time()

                if  rank == 0: start_training = time()

                optimizer.zero_grad()

                #forward pass
                output = ddp_model(image)
                loss = criterion(output, label)

                #backward and optimize
                loss.backward()
                optimizer.step()

                if  rank == 0: stop_training = time() 
                if (i + 1) % 200 == 0 and  rank == 0:
                    print('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}, Time data load: {:.3f}ms, Time training: {:.3f}ms'.format(epoch + 1, args.epochs,i + 1, total_step, loss.item(), (stop_dataload - start_dataload)*1000,(stop_training - start_training)*1000))  

                if  rank == 0: start_dataload = time()

            #Save checkpoint at every end of epoch
            if  rank == 0:
                torch.save(ddp_model.state_dict(), './checkpoint/{}GPU_{}epoch.checkpoint'.format( size, epoch+1))

        if  rank == 0:
            print(">>> Training complete in: " + str(datetime.now() - start))

Could you rerun the NCCL use case via export NCCL_DEBUG=INFO and post the logs here?

Traceback (most recent call last):
File “arctic_run_folders/basecode_multinode_multigpu.py”, line 248, in
Traceback (most recent call last):
File “arctic_run_folders/basecode_multinode_multigpu.py”, line 248, in
Traceback (most recent call last):
File “arctic_run_folders/basecode_multinode_multigpu.py”, line 248, in
Traceback (most recent call last):
File “arctic_run_folders/basecode_multinode_multigpu.py”, line 248, in
main(args, rank, local_rank, size, cpus_per_task, hostnames, gpu_ids, NODE_ID, MASTER_ADDR)
main(args, rank, local_rank, size, cpus_per_task, hostnames, gpu_ids, NODE_ID, MASTER_ADDR)
File “arctic_run_folders/basecode_multinode_multigpu.py”, line 46, in main
File “arctic_run_folders/basecode_multinode_multigpu.py”, line 46, in main
main(args, rank, local_rank, size, cpus_per_task, hostnames, gpu_ids, NODE_ID, MASTER_ADDR)
File “arctic_run_folders/basecode_multinode_multigpu.py”, line 46, in main
ddp_model = DistributedDataParallel(model, device_ids=[local_rank])
File “/userapp/virtualenv/SR_ENV/venv/lib/python3.7/site-packages/torch/nn/parallel/distributed.py”, line 578, in init
main(args, rank, local_rank, size, cpus_per_task, hostnames, gpu_ids, NODE_ID, MASTER_ADDR)
File “arctic_run_folders/basecode_multinode_multigpu.py”, line 46, in main
ddp_model = DistributedDataParallel(model, device_ids=[local_rank])
File “/userapp/virtualenv/SR_ENV/venv/lib/python3.7/site-packages/torch/nn/parallel/distributed.py”, line 578, in init
ddp_model = DistributedDataParallel(model, device_ids=[local_rank])
File “/userapp/virtualenv/SR_ENV/venv/lib/python3.7/site-packages/torch/nn/parallel/distributed.py”, line 578, in init
dist._verify_model_across_ranks(self.process_group, parameters)
RuntimeError: NCCL error in: …/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, unhandled system error, NCCL version 21.0.3
ncclSystemError: System call (socket, malloc, munmap, etc) failed.
dist._verify_model_across_ranks(self.process_group, parameters)
RuntimeError: NCCL error in: …/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, unhandled system error, NCCL version 21.0.3
ncclSystemError: System call (socket, malloc, munmap, etc) failed.
dist._verify_model_across_ranks(self.process_group, parameters)
RuntimeError: NCCL error in: …/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, unhandled system error, NCCL version 21.0.3
ncclSystemError: System call (socket, malloc, munmap, etc) failed.
ddp_model = DistributedDataParallel(model, device_ids=[local_rank])
File “/userapp/virtualenv/SR_ENV/venv/lib/python3.7/site-packages/torch/nn/parallel/distributed.py”, line 578, in init
dist._verify_model_across_ranks(self.process_group, parameters)
RuntimeError: NCCL error in: …/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, unhandled system error, NCCL version 21.0.3
ncclSystemError: System call (socket, malloc, munmap, etc) failed.
srun: error: acidsgcn001: tasks 0-1: Exited with exit code 1
srun: error: acidsgcn002: tasks 2-3: Exited with exit code 1

My jobsubmit file is

#!/bin/bash
#SBATCH --job-name=pytorch_multinode
#SBATCH -w acidsgcn001,acidsgcn002
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=2
#SBATCH --gres=gpu:2
#SBATCH --cpus-per-task=4
#SBATCH --partition=qGPU48
#SBATCH --mem=128gb
#SBATCH --gres=gpu:V100:2
#SBATCH --time=01:00:00
#SBATCH --mail-type=END,BEGIN,FAIL
#SBATCH --mail-user=sshrestha8@student.gsu.edu
#SBATCH --account=csc344r73
#SBATCH --output=outputs/output_%j
#SBATCH --error=errors/error_%j

cd /scratch
mkdir $SLURM_JOB_ID
cd $SLURM_JOB_ID

iget -r /arctic/projects/csc344s73/nsightdemo/arctic_run_folders

source /userapp/virtualenv/SR_ENV/venv/bin/activate
export NCCL_DEBUG=INFO
srun python -u arctic_run_folders/basecode_multinode_multigpu.py

cd /scratch
icd /arctic/projects/csc344s73
iput -rf $SLURM_JOB_ID

The nccl one seems working after adding export NCCL_IB_DISABLE=1

Could you check if adding --ipc=host would also work?

where do I keep this?

This flag is used for docker containers. If you are not using any containers, check if ulimit is properly set.