Hello, everyone. Our model has been training successfully with DDP on 4 GPUS on one node. I must thank friends on the forum for their help.
Now I am working on training on multiple nodes (ports), and I have no experience about this. We rent 4 computing nodes from an outside company, and they provide me an IP of the server, 4 port numbers with 4 GPUs each, and a password. I used Autorized_keys to set up a password-free login across nodes.
After I read some instructions online, I use the following codes and get the error. I am not sure whether the problem is due to the need to enter the login password, but I’ve set the authorized_keys. Or is it due to other problems? And Is it a necessity for me to learn to use other platforms or softwares like slurm?
set_start_method("forkserver")
parser = argparse.ArgumentParser()
parser.add_argument('--local_rank', default=-1, type=int, help='node rank for distributed training')
args = parser.parse_args()
torch.cuda.set_device(args.local_rank)
torch.cuda.empty_cache()
torch.distributed.init_process_group(backend="nccl", init_method='env://')
device = torch.device('cuda', args.local_rank)
self.BartNN = self.BartNN.to(device)
self.BartNN = torch.nn.parallel.DistributedDataParallel(self.BartNN,
device_ids=[args.local_rank],output_device=args.local_rank)
I just enter the command respectively on different ports:
NCCL_IB_DISABLE=1 python -m torch.distributed.launch --nproc_per_node=4 --nnodes=4 --node_rank=0 --master_addr="***.**.**.**" --master_port=37692 main.py
NCCL_IB_DISABLE=1 python -m torch.distributed.launch --nproc_per_node=4 --nnodes=4 --node_rank=1 --master_addr="***.**.**.**" --master_port=37692 main.py
NCCL_IB_DISABLE=1 python -m torch.distributed.launch --nproc_per_node=4 --nnodes=4 --node_rank=2 --master_addr="***.**.**.**" --master_port=37692 main.py
NCCL_IB_DISABLE=1 python -m torch.distributed.launch --nproc_per_node=4 --nnodes=4 --node_rank=3 --master_addr="***.**.**.**" --master_port=37692 main.py
(Here to disable IB is due to a previous problem Pytorch DDP NCCL Error : Call to ibv_reg_mr failed with error Cannot allocate memory - distributed - PyTorch Forums)
Below is the error information.
For node0:
warnings.warn(
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
Traceback (most recent call last):
File "/data/data/anaconda3/envs/pytorch/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/data/data/anaconda3/envs/pytorch/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/data/data/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/launch.py", line 195, in <module>
main()
File "/data/data/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/launch.py", line 191, in main
launch(args)
File "/data/data/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/launch.py", line 176, in launch
run(args)
File "/data/data/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/data/data/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/data/data/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 237, in launch_agent
result = agent.run()
File "/data/data/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
result = f(*args, **kwargs)
File "/data/data/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 709, in run
result = self._invoke_run(role)
File "/data/data/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 844, in _invoke_run
self._initialize_workers(self._worker_group)
File "/data/data/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
result = f(*args, **kwargs)
File "/data/data/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 678, in _initialize_workers
self._rendezvous(worker_group)
File "/data/data/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
result = f(*args, **kwargs)
File "/data/data/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 538, in _rendezvous
store, group_rank, group_world_size = spec.rdzv_handler.next_rendezvous()
File "/data/data/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/static_tcp_rendezvous.py", line 55, in next_rendezvous
self._store = TCPStore( # type: ignore[call-arg]
RuntimeError: Stop_waiting response is expected
For node1:
Traceback (most recent call last):
File "/data/data/anaconda3/envs/pytorch/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/data/data/anaconda3/envs/pytorch/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/data/data/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/launch.py", line 195, in <module>
main()
File "/data/data/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/launch.py", line 191, in main
launch(args)
File "/data/data/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/launch.py", line 176, in launch
run(args)
File "/data/data/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/data/data/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/data/data/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 237, in launch_agent
result = agent.run()
File "/data/data/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
result = f(*args, **kwargs)
File "/data/data/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 709, in run
Aborted (core dumped)