RuntimeError: Interrupted system call when doing distributed training

Vedant_Roy · August 22, 2022, 7:41am

Hi! I was wondering if anyone else has hit the following error when doing distributed training. My configuration is 8xA100 on a single node.

One thing to note is I hit this error with only 2 GPUs on a single node, but the error rate increases the more GPUs I have. This only happens in the initialization phase. E.g, once training properly starts the error never happens. I’m pretty sure it has something to do with the creation of the “C10d Store”.

  File "train_mae_2d.py", line 120, in train
    run_trainer(

  File "train_mae_2d.py", line 41, in run_trainer
    trainer = make_trainer(

  File "/home/ubuntu/video-recommendation/trainer/trainer.py", line 78, in make_trainer
    return Trainer(

  File "/home/ubuntu/miniconda/envs/video-rec/lib/python3.8/site-packages/composer/trainer/trainer.py", line 781, in __init__
    dist.initialize_dist(self._device, datetime.timedelta(seconds=dist_timeout))

  File "/home/ubuntu/miniconda/envs/video-rec/lib/python3.8/site-packages/composer/utils/dist.py", line 433, in initialize_dist
    dist.init_process_group(device.dist_backend, timeout=timeout)

  File "/home/ubuntu/miniconda/envs/video-rec/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 595, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)

  File "/home/ubuntu/miniconda/envs/video-rec/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 257, in _env_rendezvous_handler
    store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout)

  File "/home/ubuntu/miniconda/envs/video-rec/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 188, in _create_c10d_store
    return TCPStore(

RuntimeError: Interrupted system call

Cross-posted here: RuntimeError: Interrupted system call when doing distributed training · Issue #83824 · pytorch/pytorch · GitHub

H-Huang · August 22, 2022, 3:48pm

Hi, replied to the issue on github RuntimeError: Interrupted system call when doing distributed training · Issue #83824 · pytorch/pytorch · GitHub