Hi! I was wondering if anyone else has hit the following error when doing distributed training. My configuration is 8xA100 on a single node.
One thing to note is I hit this error with only 2 GPUs on a single node, but the error rate increases the more GPUs I have. This only happens in the initialization phase. E.g, once training properly starts the error never happens. I’m pretty sure it has something to do with the creation of the “C10d Store”.
File "train_mae_2d.py", line 120, in train
run_trainer(
File "train_mae_2d.py", line 41, in run_trainer
trainer = make_trainer(
File "/home/ubuntu/video-recommendation/trainer/trainer.py", line 78, in make_trainer
return Trainer(
File "/home/ubuntu/miniconda/envs/video-rec/lib/python3.8/site-packages/composer/trainer/trainer.py", line 781, in __init__
dist.initialize_dist(self._device, datetime.timedelta(seconds=dist_timeout))
File "/home/ubuntu/miniconda/envs/video-rec/lib/python3.8/site-packages/composer/utils/dist.py", line 433, in initialize_dist
dist.init_process_group(device.dist_backend, timeout=timeout)
File "/home/ubuntu/miniconda/envs/video-rec/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 595, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/home/ubuntu/miniconda/envs/video-rec/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 257, in _env_rendezvous_handler
store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout)
File "/home/ubuntu/miniconda/envs/video-rec/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 188, in _create_c10d_store
return TCPStore(
RuntimeError: Interrupted system call
Cross-posted here: RuntimeError: Interrupted system call when doing distributed training · Issue #83824 · pytorch/pytorch · GitHub