3 or more multi gpus DDP command crashes while 2 GPUs works with exactly same code

3 or more multi gpus DDP command crashes while 2 GPUs works with exactly same code. All GPUs are in one physical computer server, so I am using this command to run 2 GPUs command line which works fine as below:
The command line: torchrun --standalone --nproc_per_node=2 train.py --run_name test


torchrun --standalone --nproc_per_node=2 train.py --run_name test
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
Configuration:
          local_rank: 0
         global_rank: 0
[GPU 0]: Setting up distributed training..
[GPU 1]: Setting up distributed training..
[GPU 0]: initiating distributed process with nccl
[GPU 1]: initiating distributed process with nccl
[GPU 1]: setting device with cuda in local rank
[GPU 1]: Starting training
[GPU 1] is using device: cuda
[GPU 1] is loading dataset
[GPU 0]: setting device with cuda in local rank
[GPU 0]: Starting training
[GPU 0] is using device: cuda
[GPU 0] is loading dataset
[GPU 1]: Training Data Size : 50400
[GPU 1]: Validation Data Size : 5600
[GPU 0]: Training Data Size : 50400
[GPU 0]: Validation Data Size : 5600
GPU 1 epoch: 0   current iteration: 0 / max iteration size: 3150
GPU 1 lossD: 3.53178071975708    lossG: 115.01837158203125 batch_time: 2.9853923320770264s
GPU 1 L_adv: 2.7582998275756836 L_id: 0.9624626636505127 L_attr: 3.145153522491455 L_rec: 0.2950518727302551 

GPU 1 L_cycle: 1.0526740550994873 

GPU 1 L_cycle_identity: 1.5273687839508057 

GPU 1 L_contrastive: 2.3706552982330322 

GPU 1 L_source_unet: 1.0193418264389038 

GPU 1 L_target_unet: 0.6420781016349792 

GPU 1 L_shape: 16.778409957885742 

GPU 1 loss_adv_accumulated: 20.0 

GPU 0 epoch: 0   current iteration: 0 / max iteration size: 3150
GPU 0 lossD: 3.509223461151123    lossG: 110.80746459960938 batch_time: 3.074448823928833s
GPU 0 L_adv: 2.7755603790283203 L_id: 0.9654433131217957 L_attr: 3.1384010314941406 L_rec: 0.41165027022361755

However, when I am using the exact same code except changing

–nproc_per_node=2

to

–nproc_per_node=3

, the multigpu shows error as below: (the full command: torchrun --standalone --nproc_per_node=3 train.py --run_name test)

torchrun --standalone --nproc_per_node=3 train.py --run_name test
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
Configuration:
          local_rank: 0
         global_rank: 0
[GPU 0]: Setting up distributed training..
[GPU 1]: Setting up distributed training..
[GPU 2]: Setting up distributed training..
[GPU 0]: initiating distributed process with nccl
[GPU 1]: initiating distributed process with nccl
[GPU 2]: initiating distributed process with nccl
[GPU 1]: setting device with cuda in local rank
[GPU 1]: Starting training
[GPU 1] is using device: cuda
[GPU 1] is loading dataset
[GPU 2]: setting device with cuda in local rank
[GPU 2]: Starting training
[GPU 2] is using device: cuda
[GPU 2] is loading dataset
[GPU 0]: setting device with cuda in local rank
[GPU 0]: Starting training
[GPU 0] is using device: cuda
[GPU 0] is loading dataset
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2710 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -7) local_rank: 0 (pid: 2708) of binary: /opt/conda/bin/python3
Traceback (most recent call last):
  File "/opt/conda/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/run.py", line 756, in run
    )(*cmd_args)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 248, in launch_agent
    failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
====================================================
train.py FAILED
----------------------------------------------------
Failures:
[1]:
  time      : 2024-04-03_02:06:21
  host      : a84dc08466fb
  rank      : 1 (local_rank: 1)
  exitcode  : -7 (pid: 2709)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 2709
----------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-04-03_02:06:21
  host      : a84dc08466fb
  rank      : 0 (local_rank: 0)
  exitcode  : -7 (pid: 2708)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 2708
====================================================

I have no idea why this is happening. Could anyone enlighten me on how to fix the issue?
Thank you.

pls show the train.py source code

@xgbj Here is the train.py code with omitting parts that are not much related to multi GPUs. Much of original code is cut off already down there.


def train_one_epoch( ):
    
            id_ext_src_input, id_ext_tgt_input, Xt_f, Xt_b, Xs_f, Xs_b, same_person = data

            id_ext_src_input = id_ext_src_input.to(args.device)
            id_ext_tgt_input = id_ext_tgt_input.to(args.device)
            
            Xs_f = Xs_f.to(args.device)
            # Xs.shape
            Xt_f = Xt_f.to(args.device)
            # Xt.shape
            same_person = same_person.to(args.device)
            realtime_batch_size = Xt_f.shape[0] 


            mixed_id_embedding, src_id_emb, tgt_id_emb = id_extractor.module.forward(id_ext_src_input, id_ext_tgt_input) ## id_embedding = [B, 769]

            diff_person = torch.ones_like(same_person)

            if args.diff_eq_same:
                same_person = diff_person

            # generator training
            opt_G.zero_grad()

            swapped_face, recon_f_src, recon_f_tgt = G.module.forward(Xt_f, Xs_f, mixed_id_embedding)
            Xt_f_attrs = G.module.CUMAE_tgt(Xt_f) 

            ....
            ....
        
        batch_time = time.time() - start_time
        
        if iteration % args.show_step == 0:
            images = [Xs_f, Xt_f, swapped_face]
            if args.eye_detector_loss:
                Xt_f_eyes_img = paint_eyes(Xt_f, Xt_f_eyes)
                # print(f'eyes: ', {Xt_f_eyes.shape})
                # break
                Yt_f_eyes_img = paint_eyes(swapped_face, swapped_face_eyes)
                images.extend([Xt_f_eyes_img, Yt_f_eyes_img])
            image = make_image_list(images)
            if args.use_wandb:
                wandb.log({"gen_images":wandb.Image(image, caption=f"{epoch:03}" + '_' + f"{iteration:06}")})
            else:
                cv2.imwrite('./images/generated_image.jpg', image[:,:,::-1])
        
        if iteration % 10 == 0:
            print(f'GPU {config["local_rank"]} epoch: {epoch}   current iteration: {iteration} / max iteration size: {len(train_dataloader)}')
            print(f'GPU {config["local_rank"]} lossD: {lossD.item()}    lossG: {lossG.item()} batch_time: {batch_time}s')
            print(f'GPU {config["local_rank"]} L_adv: {L_adv.item()} L_id: {L_id.item()} L_attr: {L_attr.item()} L_rec: {L_rec.item()} \n')
            if args.eye_detector_loss:
                print(f'GPU {config["local_rank"]} L_l2_eyes: {L_l2_eyes.item()} \n')
            if args.landmark_detector_loss:
                print(f'GPU {config["local_rank"]} L_landmarks: {L_landmarks.item()} \n')
            if args.cycle_loss:
                print(f'GPU {config["local_rank"]} L_cycle: {L_cycle.item()} \n')
            # if args.cycle_identity_loss:
                print(f'GPU {config["local_rank"]} L_cycle_identity: {L_cycle_identity.item()} \n')
            if args.contrastive_loss:
                print(f'GPU {config["local_rank"]} L_contrastive: {L_contrastive.item()} \n')
            if args.unet_loss:
                print(f'GPU {config["local_rank"]} L_source_unet: {L_source_unet.item()} \n')    
                print(f'GPU {config["local_rank"]} L_target_unet: {L_target_unet.item()} \n')
            if args.shape_loss:
                print(f'GPU {config["local_rank"]} L_shape: {L_shape.item()} \n')
                
            print(f'GPU {config["local_rank"]} loss_adv_accumulated: {loss_adv_accumulated} \n')
            if args.scheduler:
                print(f'GPU {config["local_rank"]} scheduler_G lr: {scheduler_G.get_last_lr()} scheduler_D lr: {scheduler_D.get_last_lr()} \n')

        if args.use_wandb:
            if args.eye_detector_loss:
                wandb.log({"loss_eyes": L_l2_eyes.item()}, commit=False)
            if args.landmark_detector_loss:
                wandb.log({"loss_landmarks": L_landmarks.item()}, commit=False)
            if args.cycle_loss:
                wandb.log({"loss_cycle": L_cycle.item()}, commit=False)
            # if args.cycle_identity_loss:
                wandb.log({"loss_cycle_identity": L_cycle_identity.item()}, commit=False)
            if args.contrastive_loss:
                wandb.log({"loss_contrastive": L_contrastive.item()}, commit=False)
            if args.unet_loss:
                wandb.log({"loss_source_unet": L_source_unet.item()}, commit=False) 
                wandb.log({"loss_target_unet": L_target_unet.item()}, commit=False)
            if args.shape_loss:
                wandb.log({"loss_shape": L_shape.item()}, commit=False)


                
            wandb.log({
                    "loss_id": L_id.item(),
                    "lossD": lossD.item(),
                    "lossG": lossG.item(),
                    "loss_adv": L_adv.item(),
                    "loss_attr": L_attr.item(),
                    "loss_rec": L_rec.item(),

        if iteration % 10000 == 0:
            
            if config['global_rank'] == 0:

                torch.save({
                    'epoch': epoch,
                    'iteration': iteration,
                    'batch_size': args.batch_size,
                    'model_state_dict': G.module.state_dict(),
                    'optimizer_state_dict': opt_G.state_dict(),
                    'wandb_project': args.wandb_project,
                    'wandb_entity': args.wandb_entity
                }, f'./saved_models_{args.run_name}/G_latest.pth')
                
                print('Generator model checkpoint saved')

                torch.save({
                    'epoch': epoch,
                    'iteration': iteration,
                    'batch_size': args.batch_size,
                    'model_state_dict': D.module.state_dict(),
                    'optimizer_state_dict': opt_D.state_dict(),
                    'wandb_project': args.wandb_project,
                    'wandb_entity': args.wandb_entity
                }, f'./saved_models_{args.run_name}/D_latest.pth')
                           
                print('Discriminator model checkpoint saved')

                torch.save({
                    'epoch': epoch,
                    'iteration': iteration,
                    'batch_size': args.batch_size,
                    'model_state_dict': G.module.state_dict(),
                    'optimizer_state_dict': opt_G.state_dict(),
                    'wandb_project': args.wandb_project,
                    'wandb_entity': args.wandb_entity
                }, f'./current_models_{args.run_name}/G_' + str(epoch)+ '_' + f"{iteration:06}" + '.pth')

                torch.save({
                    'epoch': epoch,
                    'iteration': iteration,
                    'batch_size': args.batch_size,
                    'model_state_dict': D.module.state_dict(),
                    'optimizer_state_dict': opt_D.state_dict(),
                    'wandb_project': args.wandb_project,
                    'wandb_entity': args.wandb_entity
                }, f'./current_models_{args.run_name}/G_' + str(epoch)+ '_' + f"{iteration:06}" + '.pth')                

              

        if (iteration % 100 == 0) and (args.use_wandb) and config['global_rank'] == 0:

            G.eval()

            ....
            
            output = np.concatenate((output1, output2), axis=1)

            wandb.log({"our_images":wandb.Image(output, caption=f"{epoch:03}" + '_' + f"{iteration:06}")})

            G.train()


def train(args, config):
    
    ##Multi GPU setting
    assert torch.cuda.is_available(), "Training on CPU is not supported as Multi-GPU strategy is set"
    device = args.device
    print(f"[GPU {config['local_rank']}] is using device: {args.device}")
    print(f"[GPU {config['local_rank']}] is loading dataset")

    # training params
    batch_size = args.batch_size
    max_epoch = args.max_epoch    

    f_3d_path = "/datasets/pretrained/pretrained_model.pth"
    f_id_path = "/datasets/pretrained/backbone.pth"
    id_extractor = ShapeAwareIdentityExtractor(f_3d_path, f_id_path, args.mixed_precision, args.id_mode).to(args.device)
    id_extractor = DistributedDataParallel(id_extractor, device_ids=[config['local_rank']])
    id_extractor.eval()


    G = CrossUnetAttentionGenerator(backbone='unet', num_adain = args.num_adain).to(args.device)
    opt_G = optim.Adam(G.parameters(), lr=args.lr_G, betas=(0, 0.999), weight_decay=1e-4)
    G = DistributedDataParallel(G, device_ids=[config['local_rank']])
    
    D = MultiscaleDiscriminator(input_nc=3, n_layers=5, norm_layer=torch.nn.InstanceNorm2d).to(args.device)
    opt_D = optim.Adam(D.parameters(), lr=args.lr_D, betas=(0, 0.999), weight_decay=1e-4)
    D = DistributedDataParallel(D, device_ids=[config['local_rank']])
    

    if args.mixed_precision == True:  
        netArc = iresnet100(fp16=True)
    else:
        netArc = iresnet100(fp16=False)
    netArc.load_state_dict(torch.load('/datasets/pretrained/backbone.pth'))
    netArc = netArc.to(args.device)
    netArc = DistributedDataParallel(netArc, device_ids=[config['local_rank']])
    netArc.eval()
    



    if args.eye_detector_loss:
        model_ft = models.FAN(4, "False", "False", 98)
        # checkpoint = torch.load('./AdaptiveWingLoss/AWL_detector/WFLW_4HG.pth')
        checkpoint = torch.load('/datasets/pretrained/WFLW_4HG.pth')
        
        if 'state_dict' not in checkpoint:
            model_ft.load_state_dict(checkpoint)
        else:
            pretrained_weights = checkpoint['state_dict']
            model_weights = model_ft.state_dict()
            pretrained_weights = {k: v for k, v in pretrained_weights.items() \
                                  if k in model_weights}
            model_weights.update(pretrained_weights)
            model_ft.load_state_dict(model_weights)
        model_ft = model_ft.to(args.device)
        model_ft = DistributedDataParallel(model_ft, device_ids=[config['local_rank']])
        model_ft.eval()
    else:
        model_ft=None

    if args.scheduler:
        scheduler_G = scheduler.StepLR(opt_G, step_size=args.scheduler_step, gamma=args.scheduler_gamma)
        scheduler_D = scheduler.StepLR(opt_D, step_size=args.scheduler_step, gamma=args.scheduler_gamma)
    else:
        scheduler_G = None
        scheduler_D = None
        
    starting_epoch = 0
    if args.pretrained:
        try:
            G_state = torch.load(f'./saved_models_{args.run_name}/G_latest.pth')
            D_state = torch.load(f'./saved_models_{args.run_name}/D_latest.pth')
            
            G.load_state_dict(G_state['model_state_dict'])
            starting_epoch = G_state['epoch'] + 1
            starting_iteration = G_state['iteration'] + 1
            opt_G.load_state_dict(G_state['optimizer_state_dict'])
            print(f'GPU {config["local_rank"]} - Preloading model ./saved_models_{args.run_name}/G_latest.pt')
            
            D.load_state_dict(D_state['model_state_dict'])
            starting_epoch = D_state['epoch'] + 1
            starting_iteration = D_state['iteration'] + 1
            opt_D.load_state_dict(D_state['optimizer_state_dict'])
            print(f'GPU {config["local_rank"]} - Preloading model ./saved_models_{args.run_name}/D_latest.pt')
                        
            print(f'[GPU {config["local_rank"]}]: Loaded pretrained weights for G and D')
        except FileNotFoundError as e:
            print(f'[GPU {config["local_rank"]}]: Not found pretrained weights. Continue without any pretrained weights.')
    else:
        starting_iteration = 0

    
    dataset = FaceEmbedCombined(ffhq_data_path = args.ffhq_data_path, same_prob=0.8, same_identity=args.same_identity)
    

    dataset_size = len(dataset)
    train_size = int(dataset_size * args.train_ratio)
    validation_size = int(dataset_size - train_size)
    
    train_dataset, validation_dataset = random_split(dataset, [train_size, validation_size])

    print(f'[GPU {config["local_rank"]}]: Training Data Size : {len(train_dataset)}')
    print(f'[GPU {config["local_rank"]}]: Validation Data Size : {len(validation_dataset)}')
    
    train_dataloader = DataLoader(train_dataset, batch_size=args.batch_size, shuffle=False, drop_last=True, sampler=DistributedSampler(train_dataset, shuffle=True))

    valid_dataloader = DataLoader(validation_dataset, batch_size=args.val_batch_size, shuffle=True, drop_last=True)

    loss_adv_accumulated = 20.
    
    for epoch in range(starting_epoch, max_epoch):
        torch.cuda.empty_cache()
        G.train()
        D.train()
        
        train_one_epoch()
        


def main(args):

    config = dict()
    # config.update(vars(args))
    config['local_rank'] = int(os.environ['LOCAL_RANK'])
    config['global_rank'] = int(os.environ['RANK'])

    assert config['local_rank'] != -1, "LOCAL_RANK environment variable not set"
    assert config['global_rank'] != -1, "RANK environment variable not set"
    
    # Print configuration (only once per server)
    if config['local_rank'] == 0:
        print("Configuration:")
        for key, value in config.items():
            print(f"{key:>20}: {value}")  



    if args.use_wandb==True and config['global_rank'] == 0:
        wandb.init(project=args.wandb_project, 
                   entity=args.wandb_entity, 
                   settings=wandb.Settings(start_method='fork'),
                #    id=args.wandb_id,
                   resume='allow')

        
    # Setup distributed training
    print(f'[GPU {config["local_rank"]}]: Setting up distributed training..')
    
    init_process_group(backend='nccl', timeout=datetime.timedelta(seconds=5400))
    print(f'[GPU {config["local_rank"]}]: initiating distributed process with nccl')
    
    torch.cuda.set_device(config['local_rank'])  
    print(f'[GPU {config["local_rank"]}]: setting device with cuda in local rank')
    
    print(f'[GPU {config["local_rank"]}]: Starting training')\
    # train(args, device=device)
    train(args, config)
    
    # Clean up distributed training
    destroy_process_group()
    print(f'[GPU {config["local_rank"]}]: destroyed distributed process after training')




if __name__ == "__main__":
    parser = argparse.ArgumentParser()

    parser.add_argument('--ffhq_data_path', default='/datasets/FFHQ_parsed_img', type=str,help='path to ffhq dataset in string format')
    args = parser.parse_args()
    
    if not os.path.exists(f'./saved_models_{args.run_name}'):
        os.mkdir(f'./saved_models_{args.run_name}')
        os.mkdir(f'./current_models_{args.run_name}')
    
    main(args)

Hello, I cannot see any obvious issues from your code. It seems that the initialization steps of “-nproc_per_node=3” are proceeding normally. To determine whether the error occurs during model initialization or dataset initialization, you can add additional detailed logging before data random_split in train(). This can help you pinpoint where the problem is occurring.

Thank you. I will try what you suggested. As the problem is not solved, the question will be still open.