Flock: Input/output error, Aborted (core dumped)

nathnayan · September 24, 2021, 8:23am

I am trying to train a YOLO model with two NVIDIA GPUs in a linux server with the following versions:
PyTorch Version: 1.7.0
CUDA: 11.0

For distributed training initialization:

if len(gpu_list) > 1:

    assert distributed_folder, "The distributed-training folder isnt't set" #check if not the default 0
    distributed_learning_filename = str(Nnet_name) + "_distlearn_setup" #remove this file when program is stopped!!
    distributed_init_filepath = os.path.join(distributed_folder, distributed_learning_filename)
    #there is more than 1 GPU-index than use distributed training

    dist.init_process_group(backend='nccl',  # use distributed backend 'nccl'
                            init_method='file://' + str(distributed_init_filepath), #file used to setup the distributed learning
                            world_size = 2, #number of nodes for distributed training
                            rank = 0 #distributed training node rank
                            )

Its works fine while I use single GPU, but getting some unexpected error for 2 GPUs. The errors are following:

File train.py", line 209, in train
rank = distributed_node_rank
File “/home/miniconda3/envs/pytorchDevelopment1.7/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py”, line 455, in init_process_group
barrier()
File “/home/miniconda3/envs/pytorchDevelopment1.7/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py”, line 1960, in barrier
work = _default_pg.barrier()
RuntimeError: flock: Input/output error
terminate called after throwing an instance of ‘std::system_error’
what(): flock: Input/output error
Aborted (core dumped)

I set up the world_size 2, and rank 0, and the errors is coming from the init_process_group().

Any comments or suggestions would be appreciated.