I am trying to train a YOLO model with two NVIDIA GPUs in a linux server with the following versions:
PyTorch Version: 1.7.0
CUDA: 11.0
For distributed training initialization:
if len(gpu_list) > 1:
assert distributed_folder, "The distributed-training folder isnt't set" #check if not the default 0 distributed_learning_filename = str(Nnet_name) + "_distlearn_setup" #remove this file when program is stopped!! distributed_init_filepath = os.path.join(distributed_folder, distributed_learning_filename) #there is more than 1 GPU-index than use distributed training dist.init_process_group(backend='nccl', # use distributed backend 'nccl' init_method='file://' + str(distributed_init_filepath), #file used to setup the distributed learning world_size = 2, #number of nodes for distributed training rank = 0 #distributed training node rank )
Its works fine while I use single GPU, but getting some unexpected error for 2 GPUs. The errors are following:
File train.py", line 209, in train
rank = distributed_node_rank
File “/home/miniconda3/envs/pytorchDevelopment1.7/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py”, line 455, in init_process_group
barrier()
File “/home/miniconda3/envs/pytorchDevelopment1.7/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py”, line 1960, in barrier
work = _default_pg.barrier()
RuntimeError: flock: Input/output error
terminate called after throwing an instance of ‘std::system_error’
what(): flock: Input/output error
Aborted (core dumped)
I set up the world_size 2, and rank 0, and the errors is coming from the init_process_group().
Any comments or suggestions would be appreciated.